Optimal Feature Spaces for Noise-Robust Speech Recognition by ert634


									   Optimal Feature Spaces for
Noise-Robust Speech Recognition

                 Girton College
             University of Cambridge

  Submitted for the degree of Master of Philosophy
 in Computer Speech, Text and Internet Technology
I, Rogier van Dalen, of Girton College, a candidate for the M.Phil. in Computer Speech,
Text and Internet Technology, hereby declare that this dissertation and the work de-
scribed in it are my own work, unaided except as speci ed, and that the dissertation
does not contain material that has already been used to any substantial extent for any
comparable purpose. I also declare that this dissertation contains       words, includ-
ing footnotes, appendices and bibliography, and that this is less than        words, as
prescribed in the Special Regulations of the M.Phil. examinations for which I am a can-

I would like to thank my supervisor, Mark Gales, for coming up with the idea of this
project, and helping me along throughout with answers, advice and new ideas. I would
also like to thank Hank Liao for giving me access to his experimental set-ups, for his
comments, and for gure . on page .


 Automatic speech recognition
  .  Parameter estimation with expectation – maximisation . . . . . . . . . .
  .  Extracting features from audio . . . . . . . . . . . . . . . . . . . . . . . . .

 Noise robustness
  .   Single-pass retraining . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
  .   Parallel model combination       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
  .   Vector Taylor series . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
  .   Joint uncertainty decoding .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .

 Linear transformations
  .   Maximum likelihood linear regression                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
       ..             . . . . . . . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
       ..     Covariance          . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
  .   Semi-tied covariance matrices . . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .

 Linear transformations for noise robustness
  .   Predictive linear transforms . . . . . . . . . . . . . . . . . . . .                                             .   .   .   .   .   .
       ..     Predictive         . . . . . . . . . . . . . . . . . . . . . .                                           .   .   .   .   .   .
       ..     Predictive             . . . . . . . . . . . . . . . . . . . .                                           .   .   .   .   .   .
       ..     Half-iteration predictive semi-tied covariance matrices                                                  .   .   .   .   .   .
       ..     Predictive semi-tied covariance matrices . . . . . . . .                                                 .   .   .   .   .   .
  .   Computational complexity . . . . . . . . . . . . . . . . . . . . .                                               .   .   .   .   .   .
       . .    Predictive         . . . . . . . . . . . . . . . . . . . . . .                                           .   .   .   .   .   .
       . .    Predictive             . . . . . . . . . . . . . . . . . . . .                                           .   .   .   .   .   .
       . .    Predictive semi-tied covariance matrices . . . . . . . .                                                 .   .   .   .   .   .
       . .    Summary . . . . . . . . . . . . . . . . . . . . . . . . . .                                              .   .   .   .   .   .

  .   Resource Management task . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
  .            task . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
  .   Reference systems . . . . . . . . . . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
  .   Predictive semi-tied covariance matrices                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
  .   Transformations without covariance bias                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
  .   Practical considerations . . . . . . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .
.   Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


Automatic speech recognition has improved so much over the years that it is becoming
a standard feature in mobile phones, call centres, and operating systems.         e world
market leader in dictation systems claims a –            accuracy for its main product for
desktop computers. Even if this estimate is valid, this degree of reliability requires a
high-quality microphone and a noise-free environment.
    Noise is still a major stumbling block for speech recognisers. With the Resource Man-
agement recipe distributed with the Hidden Markov Model Toolkit, the speech recog-
niser obtains a .       word error rate. When noise is added at a dB signal-to-noise
ratio, this becomes . . Speech recognisers trained on clean data cannot handle noisy
        e goal of Joint uncertainty decoding (Liao and Gales         ) is to compensate a
speech recogniser for noise and integrate the uncertainty about the observations into
the acoustic models without the computational cost that comes with marginalising over
the noise components. It models the uncertainty caused by the noise with a covariance
bias, which is added to the models’ covariance matrices. is makes decoding with Joint
uncertainty decoding less e cient than is desirable. e rst problem is that models ac-
quire full covariance matrices even when the original models have diagonal covariance
matrices. Full covariance matrices cause much of the computational cost of evaluating
Gaussian distributions. Liao and Gales (         ) found that simply diagonalising the re-
sulting matrices results in poorer performance. e second problem is the existence of
the bias at all. It makes adapting to changing noise conditions much more expensive. To
solve these two problems, this work proposes approximations of Joint transforms that
trade some accuracy for e ciency.
    For an example where e ciency is important, imagine an embedded speech recog-
niser in a car, with no memory or time to spare. Given a few frames of background
noise, it would instantaneously estimate a Joint transformation and convert it to a form
that is cheap to use in decoding.
    Noise changes the nature of the resulting observation. e e ect of di erent noise
conditions can be seen in gure . on the next page. e covariances are markedly dif-
ferent. At the heart of the solution for the two problems with Joint uncertainty decoding

   Figure . Real-world speech data at various noise conditions. Depicted are the
   means and variances of the rst two        s. Changes in noise conditions cause
   changes in feature spaces. Figure provided by Hank Liao on data from Toshiba
   Research Europe Ltd.

lies the intuition that noise changes the feature space. To compensate the recogniser for
noise, the correlations that appear under the in uence of noise should be taken into ac-
count. is work will consider four di erent linear feature space transformations. Two
forms will be presented that solve problem ( ) by retaining a covariance bias, but mak-
ing it diagonal. Two other forms fully eliminate the covariance bias, solving problem ( ).
   e latter are the most e cient.
        e linear transformations are normally estimated from audio data. In this work,
however, they are estimated in a predictive fashion from Joint transformations.         ey,
in turn, are estimated on stereo data (clean speech and arti cially corrupted speech).
   is is not realistic: the uncorrupted speech is not available to the speech recogniser in
the car. However, it does make it possible to evaluate the linear transformations against
their ideal counterparts, estimated directly on the stereo data.
        ough this work is meant for readers with general knowledge of speech recogni-
tion technology, chapter gives a short introduction. Chapter introduces methods for
noise robustness. Chapter tells how linear adaptation methods work and how they
are normally estimated. Against this background chapter explains how to estimate
linear transformations from Joint transformations and details the consequences for the
computational cost. Chapter brings the theory into practice and nds the resulting
recognition accuracy. Chapter summarises the achievements in the light of practical
noise-robust speech recognition.

Automatic speech recognition

State-of-the-art speech recognisers are based on probabilistic models. To extract the
most likely words W from a sequence of observations O = o( ), . . . , o(T) , Bayes’
Rule is used:

   W = arg max P(W O) = arg max            = arg max P(O W)P(W),
   ˆ                            P(O W)P(W)
                                                                                         ( .)
          W                W       P(O)         W

so that the recognition process can be divided into the acoustic model P(O W), which
determines how words are realised, and the language model P(W), which determines
how likely a sequence of words is. is work focuses on the former.
        e acoustics of speech are modelled by hidden Markov models (              s). Hidden
Markov models assume that every observation is generated by one state of a network,
that transitions between states are probabilistic and depend only on the previous state
(the Markov assumption), and that the observations are independent except through
the states. Figure . contains a simple network modelling one phone (a small unit of
speech), with a beginning characterised by state , a middle bit characterised by state ,
and an last bit characterised by state . States and are non-emitting states, used to
connect networks. Concatenating phone networks creates word networks; sequences,
or more complicated graphs, of word networks form sentence networks.                 e arrows
between states indicate transition probabilities a i j , which are the probability of being in
state j at time t given that the state at time t − was i. Speech recognition comes down to

                         a                   a                  a

                 a                 a                  a                  a

   Figure . A standard three-state hidden Markov model that represents a phone.
   States and are non-emitting states used to concatenate models.

 nding the most likely path through the network X = x ( ), . . . , x (T) .                 e Viterbi
algorithm (Viterbi     ) does this.
   Observations consist of feature vectors.        states model the observations with
Gaussian distributions, or mixtures of Gaussian distributions. e probability of com-
                                                                                       ( j)
ponent m of state j generating the observation given that the state is j is given by c m ,
with m= c m = . Given that component m generated observation o(t), its distribu-
        M ( j)   ( j)

tion is modelled as a Gaussian distribution

                           p(o(t) m) = N o(t); µ(m) , Σ(m) ,                                   ( . )

so that

                                                M ( j)
           p(o(t) x(t) = j) = b ( j) (o(t)) =            c m ë N o(t); µ( jm) , Σ( jm) ,
                                                          ( j)
                                                                                               ( . )

where o(t) is represented as a feature vector, and µ (m) and Σ(m) are the mean and the
covariance, respectively, of the Gaussian. It is equation ( . ) that common adaptation
schemes target. It is also the focus of this work.
    Depending on how “phone” is de ned, there are around phones in the English
language (Collins and Mees       ), but their realisations di er because of various factors
that include the context. By introducing multiple models depending on the phones le
and right, rather than one model per phone, co-articulation e ects can be taken into
account. e resulting models are identi ed by the name of the le and right context,
and the phone itself. ey are therefore called “triphones”. Because many combinations
of three phones do not occur o en enough for them to be trained properly, several of
them may be clustered with a decision tree, so that they are represented by one model
(Young and Woodland         ).

 .        Parameter estimation with expectation – maximisation
It would be best to train the parameters of hidden Markov models in such a way as to
maximise the likelihood of the training data. is is called “maximum likelihood esti-
mation”. Hidden Markov models contain two types of parameters that must be trained:
the transition probabilities between states, and the output distributions for states. To
  nd maximum likelihood estimates for both, the state sequence must be known.            e
state sequence, however, is unobserved. Dempster et al. (         ) proposed an iterative
algorithm for nding a maximum-likelihood estimation with incomplete data called
“expectation – maximisation”. First a distribution of the missing data, the state sequence
X, is found given the previous parameters λ(k) and the observations O. e auxiliary
function Q λ, λ(k) is de ned as

                        Q λ, λ(k) =         log P(O, X λ)P(X O, λ(k) ),                        ( . )
                                      X X
           . .                                                                                   –

with X the space of all possible state sequences. e expectation – maximisation algo-
rithm guarantees that an increase in the auxiliary function leads to an increase in the
likelihood of the data.
       e new parameters λ(k+ ) are then set to the maximum likelihood estimate given
the state sequence distribution and the observation:

                                   λ(k+ ) = arg max Q λ, λ(k)                                         ( . )

  is process is repeated until convergence.
    Applied to hidden Markov models, the expectation step nds the expected value of
indicator variables that are if the state at time t is j, and otherwise. at is, it nds
the state posteriors
                            γ ( j) (t) = P x(t) = j O, λ(k) .                     ( . )

      To nd the state posteriors, forward probabilities α ( j) (t) and backward probabilities
β ( j) (t) need to be estimated. α ( j) (t) and β ( j) (t) can be recursively estimated from the
beginning and end of the observation sequence, respectively. ey are de ned as

                      α ( j) (t) = p o( ) . . . o(t), x(t) = j λ(k) ,                                ( . )
                      β ( j) (t) = p o(t + ) . . . o(T) x(t) = j, λ(k) ,                             ( . )

so that
                                        α (N) (T) = p O λ(k) .                                       ( . )

       e posterior probability is then given by

                                                                        α ( j) (t)β ( j) (t)
                    γ ( j) (t) = P x(t) = j O, λ(k) =                                        .       ( . )
                                                                           p O λ(k)

   By going through the observations and gathering statistics for every component,
weighting by γ (m) (t), maximum-likelihood estimates for the new hidden Markov model
parameters λ(k+ ) can be found as follows:
                                   α (i) (t)a i j (t)b( j) (o(t + ))β ( j) (t + )
                     ai j =
                     ˆ        t=
                                                                                                 ,   ( . )
                                                 t=   α (i) (t)β (i) (t)
                                   γ ( j) (t)o(t)
                   µ( j) =
                   ˆ          t=
                                                      ,                                              ( . )
                                   t=   γ ( j) (t)
                                   γ ( j) (t) o(t) − µ ( j)
                                                                        o(t) − µ ( j)
                   Σ( j) =
                   ˆ          t=
                                                                                             .       ( . )
                                                      t=   γ ( j) (t)

    To extend the parameter estimation to components, they can be seen as states. e
forward-backward algorithm applies in exactly the same way. e posterior probabil-
ity of component m is written γ (m) (t). It is sometimes convenient to express the total

occupancy of a component directly:

                                    γm =            γ (m) (t).                      ( . )

    Setting model parameters by nding weighted averages for them in the observations
is done not only for        parameters, but also for some adaptation scheme parameters
that will be discussed in chapter . To keep the number of parameters low, model trans-
formations that make the model match the data better are usually tied over a group of
components, called a regression class (Leggetter and Woodland          ). Since it is not
in general known in advance which components will bene t from the same transfor-
mation, it is o en assumed that models with similar parameters will be similarly trans-
formed. For example, regression classes can be found by bottom-up clustering based on
the Kullback-Leibler distance between distributions. is results in the acoustic space
being divided into regions. is work will denote regression classes by r.

 .      Extracting features from audio
   e nature of the observations has not been discussed yet. ey consist of a vector of
“features”. Since the raw audio samples do not ostentatiously carry much information,
the samples are processed before they end up being represented as feature vectors. One
feature vector represents a segment of audio so short, usually ms at every ms, that
the speech signal can be assumed to be stable during this period. ( is by de nition
breaks the assumption that        s make that subsequent feature vectors are indepen-
dent.) Taking a ms segment, applying a window (for example, a Hamming window),
and then applying a Fourier transform produces the spectrum for the audio segment.
     When the log amplitudes of the spectrum are taken, the parameters in the log-
spectral domain are found. To model the ear’s higher sensitivity to low frequencies, the
log-magnitude spectrum is mapped onto the Mel scale by applying triangular windows
to it. en the discrete cosine transform is used, so that the Mel frequency “cepstrum”
(an anagram of “spectrum”) is obtained.      e discrete cosine transform is a simpli ed
version of the Fourier transform. It takes advantage of the fact that the log magnitude
spectrum is real-valued and symmetric around . If B is the number of lterbank chan-
nels, then the      can be expressed as a matrix C with elements

                                                    i( j − )π
                                 c i j = cos                  .                     ( . )

    By applying the discrete cosine transform to the Mel log-magnitude spectrum, Mel
frequency cepstral coe cients (        s) are obtained. It is usual for feature vectors to
contain Mel frequency cepstral coe cients and the energy. First-order and second-
order di erentials are also added to capture the direction of the coe cient changes.
    Some methods have been proposed that increase the inherent robustness of the fea-
tures against noise.    e simplest method of adapting the data is called cepstral mean
                                        . .

normalisation. It normalises cepstral feature vectors of an utterance by subtracting the
mean. is compensates for a linear lter on the original signal. Its simplicity and e ec-
tiveness have made it ubiquitous.
    Another feature extraction scheme is called perceptual linear predictive ( ) anal-
ysis (Hermansky       ). e coe cients resulting from it are slightly more noise-robust
than       s.

Noise robustness

Speech recognisers are usually trained on clean speech data (recorded with a high-
quality microphone, with little background noise).     ey are o en used on noisy data
(e.g. a mobile phone-style microphone in a noisy environment). e mismatch between
the models and the observation causes performance to plummet. is can be alleviated
in two ways. One is to make the model match the observations; the other to make the
observations match the model. Doing either requires a model of the di erence between
the clean speech (hopefully modelled by the clean acoustic models) and the observa-
    If the signal x[m] is the clean speech, h[m] is the convolutional noise, capturing
microphone and room characteristics, and n[m] is the additive noise, capturing back-
ground noise, then the standard model of the in uence of acoustic noise is (Acero    )

                            y[m] = x[m] h[m] + n[m].                              ( .)

   is assumes that the microphone and room characteristics can be characterised by a
linear channel lter h[m]. As the dynamic Bayesian network in gure . shows, it also
assumes that the clean speech is independent of the noise, which is unlikely to be true
since people counteract noise by altering their speech. is is called the “Lombard ef-

   Figure . Dynamic Bayesian network of the noise model, with the emitting states
   shaded. Reproduced with permission from Liao and Gales (    b).
                                                                      . .                     -

fect”, it is described by Junqua and Anglade (      ) and this work ignores it.
    Speech recognisers commonly use           s (see section . on page ), so ( . ) must
be reformulated as a function of the cepstral descriptions of the clean speech vector, x,
of the convolutional noise, h, and of the additive noise, n (without di erentials). Using
the discrete cosine transform C (see section . on page ) and its rows c i , the elements
of the noisy speech vector y are given by

                               =                       (x+h)
                                                   −                  −
                       yi            c i log e C               + eC       n
                                                                                                          ( . )

                               =     x i + h i + c i log( + e (C              (n−x−h))
                                                                                                          ( . )

   is formulation makes clear that the in uence of noise in the cepstral domain is highly
        is work adapts the models to the data to compensates for noise. Section . dis-
cusses single-pass retraining, which trains new models from stereo data. Model adap-
tation techniques that improve noise robustness speci cally approximate ( . ). Sec-
tions . – . discuss parallel model combination, vector Taylor series, and Joint uncer-
tainty decoding, respectively.

 .     Single-pass retraining
    e models can be made to better match the data by taking a speech recogniser trained
on clean speech and retraining it on noisy speech. However, a clean speech recogniser
is not straight away going to get the posteriors right when fed noisy speech. For the
arti cially corrupted corpora used in this work, however, the clean data is also available.
It is therefore possible to nd the posteriors from the clean models on the clean data, but
accumulate statistics for the new parameters from the noisy data. is is called single-
pass retraining (Gales       ). e component posteriors are given by

                                   γ (m) (t) = P(x(t) = m S, λ),                                          ( . )

where S is the clean speech data.            e means and variances are then trained with
                                    γ (m) (t)o(t)
                  µ(m) =       t=
                                                        ,                                                 ( . )
                                    t=   γ (m) (t)
                                    γ (m) (t) o(t) − µ (m)
                                                                              o(t) − µ (m)
                 Σ                                                                                    .   ( . )
                                                       t=   γ (m) (t)

   e component weights and transition probabilities are not changed.
    Single-pass retraining provides a ceiling for the performance of the predictive trans-
formations.     e Joint transformations in this work have been estimated from stereo
data, and form the statistics on which the linear transformations are estimated. e ideal
predictive transforms would therefore equal to those directly estimated with single-pass

 .      Parallel model combination
Parallel model combination (Gales and Young         ) is a model compensation technique.
It can compensate for additive noise through a noise model estimated from a few frames
with only noise. With a small amount of adaptation data, it can also handle convolu-
tional noise. It operates in the cepstral domain and modi es both the means and the
variances of a model set. To do this, it uses a mismatch function that approximates the
e ects of the noise on the speech parameters, which are more naturally described in the
log-spectral domain.
       e log-normal distribution is popular. It assumes that the sum of two log-normally
distributed variables (the speech and the noise) is also log-normally distributed. In the
spectral domain, this technique therefore matches only the rst two moments of the
corrupted speech distribution. At dB, parallel model combination has been found to
restore recognition performance to that of a recogniser trained on the corrupted speech
(Gales and Young        ).

 .      Vector Taylor series
Equation ( . ) cannot be used directly to compensate the clean speech models for noise:
even if x, h, and n are all assumed to be Gaussian distributed, y will not be. By linearising
it with a truncated vector Taylor series (Moreno        ; Kim et al.    ; Acero et al.      ),
an approximation of the compensated model parameters can be derived. Only a small
amount of data is needed to nd the statistics that this method needs: the means of the
convolutional and additive noise, µ h and µ n , and the variance of the additive noise Σ n .
       e notation µ (m) for the Taylor series expansion point will be used to indicate that
 (m)                                                (m)
y    is evaluated at clean speech mean µ x , convolutional noise mean µ h , and additive
noise mean µ n . e rst-order vector Taylor series approximation of ( . ) for compo-
nent m is

                  yi = yi
                  ˆ(m)        (m)
                                    + ∇x y i       µy
                                                      (m)      ë (x − µ (m) )

                         + ∇n y i    (m)
                                               ë (n − µ n ) + ∇ h y i         (m)
                                                                                     ë (h − µ h ).    ( . )

      e resulting compensated model parameters for the static features of component
m are found by approximating the rst and second moment of y (m) (Liao and Gales

                    µ y,i = E y i
                    ˆ (m)                                                                             ( . )
                                               = yi
                             E yi
                               ˆ                         (m)
                                                                                                      ( . )

                           = µ x ,i + µ h,i + c i log( + e (C                                    ),
                               (m)                                     −
                                                                           (µ n −µ (m) −µ h ))
                                                                                   x                  ( . )

where E ë         is the expected value for component m.
                                                                                      . .

      e rst-order vector Taylor series approximation is needed to nd the in uence of
the noise on the variance:

       Σ(m) = E y y T
                             (m)                         T
        y                           − µ(m) µ(m)
                                       y    y                                                                         ( . )
                             (m)                         T
               E y yT
                 ˆˆ                 − µ(m) µ(m)
                                       y    y                                                                         ( . )
                   ∂y                       ∂y
                   ∂x    (m)
                        µy                  ∂x      µy

                                                                T                                             T
                    ∂y                         ∂y                        ∂y                       ∂y
               +                     Σh                             +                       Σn                        ( . )
                    ∂h        (m)
                             µy                ∂h       (m)
                                                       µy                ∂n     µy
                                                                                  (m)             ∂n    (m)
                                                                T                                             T
                   ∂y                       ∂y                           ∂y                       ∂y
                                   x                                +                       Σn                    .   ( . )
                   ∂x    (m)
                        µy                  ∂x      µy
                                                       (m)               ∂n     µy
                                                                                  (m)             ∂n    (m)

( . ) makes the assumption that the clean speech and noise are independent; ( . ) that
the channel noise is constant so that Σ h = . Σ x can be found from the model; Σ n can
                                                      (m)                          (m)
be found from a few frames of noise. e resulting Σ y is not diagonal, even if Σ x
and Σ n are, but is o en diagonalised to make decoding more e cient.

    Some manipulation shows that the Jacobian matrices are given by

                                                   =                    = I − CFC −
                                    ∂y                  ∂y
                                                                                                                      ( . )
                                    ∂x     (m)
                                          µy            ∂h       (m)

                                                   = CFC − ,
                                                                                                                      ( . )
                                    ∂n    µy

where F is a diagonal matrix whose elements are given by

                                                            c − µ n −µ (m) −µ h
                                          fi i =
                                                       e      i        x
                                                                                                                      ( . )
                                                              c − µ n −µ x     −µ h
                                                       +e       i

     It is interesting to see the e ect of the signal-to-noise ratio on the variance that this
model predicts. f i i varies between and depending on the value of µ n − µ x − µ h .
If the noise level µ n is high, f i i will tend to , causing ∂n (m) to tend to the identity
matrix and   ∂x µ (m)
                        to zero.           e resulting variance, in equation ( . ), therefore will
tend to the variance of the noise Σ n , and be small. If the noise level is low, the opposite
will happen, and the variance will tend to the clean speech variance Σ x . is provides
an elegant way of accounting for changes in the covariance because of noise that were
seen in gure . on page . However, compensating models with vector Taylor series
is computationally expensive, since it requires the matrix multiplications in ( . ) for
every component.

 .      Joint uncertainty decoding
Joint uncertainty decoding (Liao and Gales         ;     b) is a model compensating tech-
nique derived from a model of the joint distribution of the clean and the noisy speech.
It assumes that this distribution is Gaussian. If s(t) is a clean speech vector (including
  rst- and second-order di erentials), and o(t) is the corresponding observation, then

                            s(t)                  µs   Σs      Σ so
                                           N         ,                      ,         ( . )
                            o(t)                  µo   Σ os    Σo

with parameters speci c to the clean speech model state and the noise model state.
    If the uncertainty decoding is done in the front-end, then Joint uncertainty decoding
partitions the corrupted acoustic space into regions, for each of which a conditional dis-
tribution p(o(t) s(t)) is estimated. Model-based Joint uncertainty decoding, however,
ties this conditional distribution to the model components: every component belongs
to one regression class r. Components are compensated for the noise characteristics of
the region of the acoustic space their means are in.
        e distribution of the corrupted speech for component m in regression class r be-
               p(o(t) m) = A(r) ë N A(r) o(t) + b(r) ; µ(m) , Σ(m) + Σ b
                                                                                    ( . )


                            A(r)               Σ(r) Σ(r) ,
                                                s    os                              ( . )
                             b   (r)
                                       =       µ(r)
                                                s  − A(r) µ(r) ,
                                                           o                          ( . )
                             (r)                (r) (r) (r) T
                            Σb                 A Σo A         −    Σ(r) .
                                                                    s                ( . )

Model based Joint uncertainty decoding forms the basis of this work. In e ect, it applies
a piecewise linear transformation to the acoustic space. It also adds a bias Σ b to the
covariance, modelling the changes in the variance of noise-corrupted speech.
    To obtain the parameters of the joint distribution in ( . ), they can be estimated
from stereo data, with the clean speech and the noisy speech. is is relatively straight-
forward, and this technique is used in the work. It is also possible to nd the parameters
from the vector Taylor series approximation, or to use Joint adaptive training (Liao and
Gales       b).
    Joint uncertainty decoding with the form in ( . ) works well (Liao and Gales         a).
However, the compensated models’ covariances become full, increasing the computa-
tional complexity of decoding with a factor of d, the dimensionality of the features. e
simple solution is to diagonalise Σ b . However, doing that while not diagonalising A(r)
is mathematically wrong and leads to extremely bad accuracy (Liao and Gales          ). Di-
agonalising both, reducing the complexity, also reduces accuracy. is work will apply
feature space transformations to reduce the computational complexity of Joint uncer-
tainty decoding without reducing the accuracy.         e next chapter will introduce these

Linear transformations

Starting out with an already trained speech recogniser makes it possible to resolve the
mismatch using fewer parameters than training a recogniser from scratch would take.
Chapter has discussed techniques for resolving the mismatch because of noise specif-
ically.   is chapter discusses linear transformations.      e original form of maximum
likelihood linear regression (Leggetter and Woodland         ) transformed only the mean
and was meant speci cally to adapt the models to a speaker. Because of their generic
nature, however, linear transformations can solve not only mismatches in the speaker
accent, speaking style, and voice quality, but also in the noise condition. Because they
use fewer parameters than full speech recognisers, they can be trained on much less data
than it would take retrain models individually, and they can be estimated on-line.
    Semi-tied covariance matrices (Gales        ) have a similar form. ey provide rota-
tion matrices that apply to observations, means, and variances. is means that the ob-
servation likelihood is calculated in a di erent feature space. e original objective was
to allow the Gaussians to have diagonal covariance matrices. erefore, the algorithm
for estimating semi-tied covariance matrices nds feature spaces in which diagonalising
the covariance bias is reasonable.

 .     Maximum likelihood linear regression
Maximum likelihood linear regression (     ) transforms in their most general form
transform both means and covariances of Gaussian distributions (Gales and Woodland
    ). e new mean µ(m) and covariance Σ(m) of component m become
                    ˆ                    ˆ

                               µ(m) = A(r) µ(m) − b(r)
                                           ′             ′
                               ˆ                                                   ( .)

                               Σ(m) = H (r) Σ(m) H (r) ,
                               ˆ                                                   ( . )

with component m in regression class r. Given a small amount of training data, it is
                                                    ′      ′
possible to nd maximum likelihood estimates for A(r) , b(r) , and H (r) .

   Various speci c forms of          have been proposed. e original paper (Leggetter
and Woodland         ) transformed only the means.      is work will look at two other
forms: one that applies the same transform to means and covariances, called constrained
    , and one that adapts only the covariances, called covariance       .


   e special case A(r) = H (r) is one of the transformations to be considered in this

work. It is called contrained maximum likelihood linear regression (      ). Digalakis
et al. (    ) introduced the diagonal transform case; Gales (    a) extended it to full
transforms. It transforms the models by

                                 µ(m) = A(r) µ(m) − b(r)
                                               ′                  ′
                                 ˆ                                                                      ( . )
                                 Σ(m) = A  (r) ′        (r) ′
                                 ˆ                 Σ(m) A         .                                     ( . )

Its advantages come to the light, however, when it is written as a transformation of the
                            ′−            ′−
                  o t = A(r) o(t) + A(r) b(r) = A(r) o(t) + b(r) ,
                  ˆ                                                                ( . )

so that the observation likelihood becomes

      L(o(t); µ (m) , Σ(m) , A(r) , b(r) ) = A(r) ë N (A(r) o(t) + b(r) ; µ(m) , Σ(m) ).               ( . )

       is means that each environment-in uenced feature vector is transformed to the
feature space that the models in a regression class expect. Conceptually, it is a piecewise
linear transformation, because Gaussians are clustered as a regression class based on
their distance to each other. Computationally, models can calculate the observation
likelihood on the appropriately transformed feature vector.      is makes decoding fast
and adaptation non-invasive so that changes in the environment are easily compensated
    To obtain a maximum likelihood estimation for A(r) and b(r) , start out by formu-
lating them as an extended transformation matrix W (r) = A(r) b(r) . e extended

observation vector is ζ(t) =
                                   , so that

                               o t = A(r) o(t) + b(r) = W (r) ζ(t).
                               ˆ                                                                        ( . )

   W (r) is estimated in an iterative fashion, with one row being updated at a time.                        e
updated ith row of the transform is given by

                                   w i = α p i + k (i) G (i) ,
                                                                                                        ( . )

where p i is the extended cofactor row vector [c i           ...          c in   ] (with c i j = cof(A i j )).
                                      . .

   e statistics from the adaptation data that are needed are

                                       M               T
                            G (i) =        (m)
                                                           γ (m) (t)ζ(t)ζ(t)T ,                   ( . )
                                      m= σ i          t=

                                      M                       T
                            k (i) =                                γ (m) (t)ζ(t)T ,
                                                     µi                                           ( . )
                                      m=    σi                t=

and α satis es
                           α p i G (i) p i T + α p i G (i) k (i) − β = ,
                                       −                           −       T
                                                                                                  ( . )

                                                 M     T
                                            β=             γ (m) (t).                             ( . )
                                                 m= t=

       e computational complexity of this algorithm is dominated by the cost of calculat-
ing the cofactors and the inverse of G (i) . e latter costs O(d ) per matrix (with d the
dimension of the feature vector). A naive implementation of the former costs O(d ) per
matrix per iteration, but using the Sherman-Morrison matrix inversion lemma this can
be reduced to O(d ) (Mark Gales, personal communication). us, for R transforms
and I iterations, the cost of estimating the transforms is O(RId + Rd ).
       is does not take into account the cost of gathering the statistics. is work does
not use statistics from data, but predicts statistics based on the models and the Joint
transform. Section . . on page shows how to generate the predicted statistics and
section . on page shows the complexity.

 ..         Covariance
Covariance         updates only the covariances, i.e. it only uses ( . ). Just like con-
strained      it is most e cient when used the other way around. Rather than trans-
forming the covariances, the observations and the means are transformed, yielding

           L(o(t); µ (m) , Σ(m) , A(r) ) = A(r) ë N A(r) o(t); A(r) µ(m) , Σ(m) .                 ( . )

    A(r) is estimated in an iterative fashion, with one row being updated at a time. With
every update the value of the auxiliary function increases. e statistics required are the
occupation-weighted summed covariance from the data,

                      W (m) =       γ (m) (t) o(t) − µ (m)
                                                                           o(t) − µ (m)       .   ( . )

From that, a matrix is found for every dimension i

                                                      M (r)
                                                              W (m)
                                            G (i) =            (m)
                                                                       ,                          ( . )
                                                      m=      σi
with σ i       the ith element of the leading diagonal of Σ(m) .

        e update formula for row i of A(r) is

                                                    M (r)
                                                                  γ (m) (t)
                                  = c i G (i)
                            (r)                 −   m=        t
                          ai                                                  ,                     ( . )
                                                     ci G   (i) −   ci T

where c i is the row of the cofactors of A.
    Just as for        , calculating the inverse of G (i) and nding the co-factors form
the main computational cost, so that estimating transforms for R regression classes in I
iterations is O(RId + Rd ), with d the size of the feature vector.            does need
to transform all model means, which takes O(Md ).
    Again, this does not take into account the cost of gathering the statistics, because
this work uses predicted statistics. Section . . on page shows how to generate the
predicted statistics and section . on page shows the complexity.

 .       Semi-tied covariance matrices
A compromise between the speed of diagonal covariance matrices and the modelling
accuracy of full ones has been found earlier. Gales (    ) proposed a scheme in which
diagonal covariance matrices share one rotation matrix per regression class. e algo-
rithm nds a transformation into a feature space in which a diagonal covariance matrix
is a more valid assumption than in the original feature space. e e ective covariance
                                                             ˜ (m)
matrix for component m is composed of a diagonal matrix Σ diag and a transformation
A(r) applied to it:
                                    Σ(m) = A(r) Σ diag A(r) .
                                               T (m)
                                    ˆ           ˜                                                   ( . )

     e observation likelihood becomes

 L o(t); µ (m) , Σ diag , A(r) = A(r) ë N A(r) o(t); A(r) µ(m) , A(r) Σ diag A(r) , ( . )
                 ˜ (m)                                               T (m)

which indicates its relation to the observation likelihood of        , in ( . ). is is also
a feature-space transformation. However,            transforms just the models; semi-tied
covariance matrices transform both transformations and the model. is points to the
di erent purposes of the transformations, even though they are both linear transforma-
tions.         assumes that the model is perfectly valid, but the observations and the
model are in di erent feature spaces. Semi-tied covariance matrices assume that the
observations and the models should be in the same feature space, but recognise that the
model, with its diagonal covariance matrix, is awed and transform the space in which
the calculations are done.
                                                                      ˜ (m)
    To estimate the parameters for semi-tied covariance matrices, Σ diag and A(r) must
be found simultaneously. is is done in an expectation – maximisation fashion. Σ       ˜ (m)
is initialised to the diagonalised original covariance, diag               Σ diag   . In the rst step, the
transformation A(r) is updated in the same way as is done for            transforms, de-
scribed in section . . . However, the current estimate for the covariance, which changes
                                                     . .             -

                           ˜ (m)                                               ˜ (m)
every iteration, is used; σdiagi is the ith element of the leading diagonal of Σ diag . In the
second step, Σ˜ (m) is set to the maximum-likelihood diagonal covariance in the feature
space given by A(r) . is is repeated until convergence.
      e full procedure is as follows. Repeat J times:

    . Estimate A(r) as in section . . :
                                     ˜ (m)
      Given the current estimate for Σ diag , iterate over the rows of A(r) , updating each
      I times. Row i of A(r) is set to

                                                              M (r)
                                                                             γ (m) (t)
                                        = ci G   (i) −
                                (r)                           m=         t
                               ai                                            −                   ( . )
                                                                  c i G (i) c i T

      where c i is the row of the cofactors of A and

                                   M (r)
                                           W (m)
                         G (i) =                                                                 ( . )
                                   m=      ˜ (m)

                        W (m) =         γ (m) (t) o(t) − µ (m) )(o(t) − µ (m)
                                                                                             .   ( . )

    . Estimate Σ diag . W (m) is the occupancy-weighted summed covariance of the data.
        e maximum-likelihood estimate for Σ   ˜ (m) is the covariance in the feature space
      expressed by A(r) :
                                                 A(r) W (m) A(r)
                                   Σ diag = diag
                                   ˜ (m)
                                                         (m) (t)
                                                                                     .           ( . )

       is scheme allows diagonal matrices to be used in encoding because they now work
in a di erent feature space. is reduces the number of variables to be estimated while
retaining much of the gain in recognition accuracy of full covariance matrices.
    Just like for       and            , the computational complexity of this algorithm is
dominated by the cost of calculating the cofactors and the inverse of G (i) . e former
costs O(d ) (with d the dimension of the feature vector) per dimension per iteration.
   e latter costs O(d ) per dimension. us, for R transforms, J outer loop iterations,
and I inner loop iterations, the cost of estimating the transforms is O(RJId + RJd ).
       is does not take into account the cost of gathering the statistics. is work does
not use statistics from data, but predicts statistics based on the models and the Joint
transform. Section . . on page shows how to generate the predicted statistics and
section . on page shows the complexity.

Linear transformations for noise

Section . on page      has introduced Joint uncertainty decoding, which leads to a trans-

             p(o(t) m) = A J
                                 (r)         (r)        (r)                 (r)
                                       ë N A J o(t) + b J ; µ(m) , Σ(m) + Σ b     ,         ( .)

           (r)   (r)       (r)
in which A J , b J , and Σ b are the Joint transform parameters.        e problem with Joint
                                                                            (r)       (r)
uncertainty decoding that this work addresses is the covariance bias Σ b . If Σ b is full,
all covariances become full, and performance decreases. Decoding with full covariance
matrices costs O(TMd ), with T the number of observations, M the number of com-
                                                                        (r)       (r)
ponents, and d the size of the feature vectors. Decoding with diagonal A J and Σ b , on
the other hand, costs O(TMd), but loses accuracy.
       e full covariance bias models the change in feature space that the noise causes.
   us, decoding with better feature spaces may obviate the need for the full covariance
bias, or, indeed, for any covariance bias at all.    e linear transformations discussed
in chapter transform the feature space without the detrimental e ect on decoding
performance that Joint uncertainty decoding’s full covariance bias has. erefore, they
will be applied to Joint transforms.
         e algorithms for estimating the linear transforms that are investigated in this work
remain the same. ey still nd the optimal transforms in a maximum likelihood sense.
   e di erence is in the data source. e linear transformations are normally estimated
based on statistics from actual data. In this work, the statistics used are the expected
values of the statistics given the models and the Joint transform.          e distribution in
( . ) is assumed to be the actual distribution of the noisy data. For example, the expected
covariance for component m in regression class r is Σ(m) +Σ b . As the linear transforms
are estimated on the predicted properties of the data, they will be called “predictive”
linear transforms as in Gales (       b).
    Section . discusses how the predictive transforms can be estimated from the mod-
                                              . .

els and the Joint transform. Section . discusses the computational complexity.

 .       Predictive linear transforms
In this work the predictive transforms will be estimated from a Joint transform. eir
parameters will be indicated by a P subscript. First, a short overview of the transforms
will be given. Four types of linear transformations will be considered.

Predictive         (see also section . . on page ) nds an optimal linear transfor-
      mation that is applied only to the observations.

Predictive             (see also section . . on page ) nds a feature space in which
      the model’s original covariance models the actual covariance well.

Half-iteration predictive semi-tied covariance matrices (see also section . on page )
      is similar to predictive       , but it does add the diagonalised covariance bias.

Predictive semi-tied covariance matrices (see also section . ) use a per-regression
      class rotation of observations and model parameters to perform the likelihood
      calculation in another feature space.

 ..      Predictive
Contrained maximum likelihood linear regression nds a feature transformation that
does not change the model but only transforms the observations. ( . ) is changed by
taking out the covariance bias and applying the transformation in ( . ) on page . It
then becomes

     p(o(t) m) = A P ë A J
                   (r)    (r)        (r)    (r)         (r)      (r)
                                ëN A P     A J o(t) + b J     + b P ; µ(m) , diag Σ(m)    .
                                                                                         ( . )

  is assumes that the mean and covariance of the original model are still correct for
noisy speech, but does transform the noise-corrupted observations to another feature
                                                      (r)      (r)
       e procedure to estimate           parameters A P and b p detailed in section . .
on page can be followed. e statistics needed in that procedure, matrices G (i) and
vectors k (i) , are replaced by their predicted values. Recall that the statistics were ex-
pressed in terms of the extended feature vector ζ(t) =

    Equation ( . ) on page shows how G (i) is found in the original algorithm. It sums
the maximum likelihood estimates of the second moment about of the distribution
of the extended observations, weighted by the diagonal entries of the variances of the
models. Its predicted value can be found using expected rst moment µ (m) and the

expected second central moment Σ(m) + Σ b of the components’ distributions:

                           M (r)                   T
                 G (i) =         (m)
                                              E          γ (m) (t)ζ(t)ζ(t)T                                      ( . )
                            m= σ i                 t=
                                                                  (r)                   T
                           M (r)                  Σ(m) + Σ b + µ(m) µ(m)                          µ(m)
                                    (m)                                     T                             .     ( . )
                            m=     σi                                µ(m)

    ( . ) on page gives the original value of k (i) . It is the sum of the observations
weighted by occupancy and the model parameters. e component mean is the average
of the observations weighted by occupancy, so that the predicted value for the statistics
in ( . ) on page can be found by

                                              M        (m)           T
                              k (i) =                                     γ (m) (t)ζ(t)T
                                                             E                                                   ( . )
                                             m=   σi                 t=

                                              M          (m)
                                                  γm µi                     T
                                                                     µ(m)          .                            ( . )
                                             m=    σi

 ..     Predictive

Predictive            nds a feature space in which the model’s original covariance is
valid. e mean, however, is transformed to the new feature space. ( . ) is changed by
taking out the covariance bias and applying the transformation in ( . ) on page . It
then becomes

  p(o(t) m) = A P
                      (r)          (r)                 (r)       (r)             (r)        (r)
                            ë AJ             ë N AP          A J o(t) + b J            ; A P µ(m) , diag Σ(m)     .
                                                                                                                ( . )

       e statistics that need to be found to estimate A(r) are the W (m) . From equa-
tion ( . ) it can be seen that these are the occupancy-weighted summed covariances.
   e predicted value for them is

                  W (m) = E                  γ (m) (t) o(t) − µ (m)
                                                                                  o(t) − µ (m)                  ( . )

                            = Σ(m) + Σ b
                                                         γm .                                                   ( . )

  en G (i) from equation ( . ) becomes

                                    M (r)
                                              W (m)          M (r)
                      G (i) =                            =
                                                                 γm                         (r)
                                               (m)               (m)
                                                                                Σ(m) + Σ b        .             ( . )
                                    m=        σdiagi         m= σdiagi
                                                               . .

 ..     Half-iteration predictive semi-tied covariance matrices
  e scheme laid out in section . on page        for estimating semi-tied covariances is
                                                                 (r)     ˜ (m)
computationally expensive, because it alternates over updating A P and Σ diag , and up-
         (r)                                                                                                             (r)
dating A P already requires iterating over its rows. An alternative is to update only A P .
Applied to ( . ), this results in a similar form to ( . ), but the covariance bias is retained
in a diagonalised form.

  p(o(t) m) = A P
                        (r)       (r)
                              ë AJ        ë
                               (r)         (r)               (r)          (r)                                (r)
                     N AP               A J o(t) + b J               ; A P µ(m) , diag Σ(m) + Σ b                  .   ( . )

   is assumes that the diagonalised predicted covariance is valid in another feature space.
   e transformation A(r) is optimised for this feature space. e rst step of the proce-
dure detailed in section . on page       can be followed. Analogously to the
transform, the predicted value for W      is

                                        W (m) = Σ(m) + Σ b
                                                                                γm ,                                   ( . )

so that G (i) (see equation ( . ) on page                      ) becomes

                                  M (r)
                                              W (m)         M (r)
                        G (i) =                         =
                                                                     γm                     (r)
                                                                                Σ(m) + Σ b         .                   ( . )
                                     m=       ˜ (m)
                                              σdiagi        m=      ˜ (m)

 ..      Predictive semi-tied covariance matrices
Semi-tied covariance matrices are the most powerful linear transforms that this work
discusses. ey transform the calculation of the Gaussian distributions in a regression
class to another feature space to make the assumption that the covariance matrix is di-
agonal, convenient in terms of the computational cost, true. e covariance bias in ( . )
is not removed, but by applying ( . ) on page , it is transformed and diagonalised:

  p(o(t) m) = A P
                        (r)       (r)
                              ë AJ        ë
           (r)    (r)                (r)          (r)                         (r)                 (r)    (r) T
      N AP       A J o(t) + b J               ; A P µ(m) , diag A P                 Σ(m) + Σ b          AP         .   ( . )

Since this transforms the observations, means, and variances, the distribution is essen-
tially calculated in another feature space. e transformation A P is optimised to allow
diagonal covariances in this feature space.
        e full procedure detailed in section . on page can be followed. e values for
W (m) and G (i) are the same as in section . . . Equation ( . ) on page sets the diag-
onal covariance of the model to the diagonalised transform covariance of the training
data. e predicted value of the covariance is Σ(m) + Σ b , so that

                              Σ diag = diag A P Σ(m) + Σ b A P
                              ˜ (m)          (r)        (r) (r)
                                                                                             .                         ( . )

 .             Computational complexity
   e current issue with using Joint uncertainty decoding in practice is the diagonal co-
variance matrices. ese make decoding of T observations O(TMd ). With diagonal
covariance matrices this becomes O(TMd). is gain is acquired at the expense of extra
time spent to estimate the linear transformation and transform the models.
       e purpose of this work is to reduce the computational complexity of Joint uncer-
tainty decoding. e ultimate goal is to instantaneously estimate a Joint transform from
noise and estimate an approximation that allows fast decoding. For this, statistics that
depend on the model (µ (m) , Σ(m) ) but not on the Joint transform (Σ b ) should be pre-
calculated.     is reduces the on-line processing time.       e complexity of the schemes
proposed in this work are functions of the number of components M, the number of
regression classes R, the dimensionality of the feature vectors d, the number of observa-
tions T, the number of inner loop iterations I, and the number of outer loop iterations J.
       e next sections discuss the complexities of the algorithms from section . . To nd
the lowest possible cost, some of the statistics must be rewritten to take advantage of the
o -line availability of the models.

 . .           Predictive
Predictive          nds a feature space transformation that does not necessitate adapting
the models at all. is makes the complexity independent of the number of components
and makes it the fastest predictive linear transformation discussed in this work.
    Equation ( . ) looks like a O(Md ) operation. However, rewriting it as
                                                     T                   (r)
                   M (r)        Σ(m) + µ(m) µ(m)           µ(m)                    M (r)
         (i)               γm                                           Σb              γm
     G                  (m)
                                                                    +                   (m)
                                                                                              ( . )
                   m= σ i                 (m) T                                    m= σ i

                                    cached                                           cached

allows most of the statistics to be cached, so that the complexity becomes O(Rd ).
    As equation ( . ) shows, assembling k (i) can be fully done o -line.

 . .           Predictive
Predictive             nds a transformation that does not rely on the covariance bias. It
turns out that therefore the part of G (i) that depends on the model can be found o -line.
( . ) can be rewritten to

                                G (i) =
                                             γm               (r)      γm
                                                     Σ(m) + Σ b        (m)
                                                                               ,              ( . )
                                          m σdiagi                  m σdiagi

                                             cached                 cached

so that the total on-line cost of nding G (i) is O(Md ). e time needed to nd the
transformation does not depend on the number of components. However, since the
                                                    . .

                                                                Half                  Full semi-
                                                                semi-tied             tied
          Finding G (i)     O(Rd )            O(Rd )            O(Md )                O(JMd )
          Inverting G (i)   O(Rd )            O(Rd )            O(Rd )                O(JRd )
          Calculating c i   O(IRd )           O(IRd )           O(IRd )               O(JIRd )
          Setting Σ(m)
                  ˜                                             O(Md)                 O(JMd )
          Setting µ(m)
                  ˜                           O(Md )            O(Md )                O(Md )

   Table .     e complexity of estimating predictive transforms from a Joint transform.
   d is the number of dimensions of the feature vector; M is the number of components;
   R is the number of regression classes; I is the number of inner iterations; J is the
   number of outer iterations (see section . on page ).

model means do need to be transformed, applying the transformation takes O(Md ).

 . .    Predictive semi-tied covariance matrices

   e computational complexity of predictive semi-tied covariance matrices does not have
to be as great as it may seem from section . . . Two observations can be made.
       e rst one is that if the models’ covariance matrices Σ(m) are diagonal, it makes
sense to split the calculation of G (i) up into two parts. ( . ) suggests an O(Md ) com-
plexity, where M (r) is the number of models in regression class r and d is the dimen-
sionality of the feature vectors. ( . ) can, however, be written as

                                 M (r)                          M (r)
                       G (i) =
                                         γm               (r)        γm
                                              Σ(m) + Σ b                        ,                  ( . )
                                    ˜ (m)
                                 m= σdiagi                         ˜ (m)
                                                                m= σdiagi

                                   O(M (r) d )       O(d + M (r) )

so that the total cost is O(Md ), assuming M         Rd.
   Similarly, ( . ) suggests a O(Md ) cost. By again assuming that Σ(m) is diagonal,

                Σ diag = diag A P Σ(m) A P
                                                T                                   (r) T
                ˜ (m)          (r)      (r)                         (r)
                                                    + diag A P Σ b A P
                                                                                            ,      ( . )

                             O(M (r) d )                            O(d )

making the total cost O(Md ) as well.
    For the full scheme, the complexities are multiplied by the number of outer iter-
ations J.    e half-iteration scheme for predictive semi-tied covariance matrices only
 nds a transformation matrix A(r) and does not adjust the covariances Σ(m) , apart from
adding the covariance bias, which is O(Md).

 . .    Summary
Table . on the preceding page details the time requirements for the approximations for
Joint transforms discussed in the previous sections. e naive implementation for cal-
culating the cofactors c i takes O(Rd ) per iteration, but using the Sherman-Morrison
matrix-inversion lemma this can be reduced to O(Rd ) per iteration (Mark Gales, per-
sonal communication). Inverting G (i) takes O(Rd ) per iteration. By using the average
                   ˜ (m)                       ˜ (m)
of the diagonal of Σ diag rather than di erent σdiagi for    i d, it may be possible to
reduce this to O(Rd ) (Mark Gales, personal communication). In all cases, by allowing
for diagonal covariances on the models compensated for noise, the complexity asso-
ciated with decoding T observations with Joint uncertainty decoding is reduced by a
factor of d.


Chapter has discussed linear transformations for noise robustness that this chapter will
explore in practice.    e experiments are conducted using the Hidden Markov Model
Toolkit (    ). An adapted version the       . from Liao and Gales (     a) was used to
estimate the Joint transforms on stereo data. e code for predictive linear transforma-
tions was added to         . , which supports semi-tied covariance matrices. Two extra
commands were implemented in the model editor HHEd. Both take a model, a Joint
transform, and a statistics le and estimate a linear transform and a new model. esti-
mates a predictive semi-tied covariance transform or a predictive            transform;
the command estimates a predictive              transform.
    A noise-corrupted version of the Resource Management corpus and the             task
provide a testbed. Since the approximations aim at nding the best trade-o between
accuracy and speed, they will be compared with diagonal Joint distributions with similar

 .      Resource Management task
    e naval        Resource Management corpus (Price et al.          ) is a medium vocabu-
lary (        words) database. It was recorded in a sound-isolated room using a head-
mounted Sennheiser                noise-cancelling microphone yielding a high signal-to-
noise ratio of dB.          speakers read        sentences, varying in length from to
seconds, of prompted script. e Destroyer Operations Room noise from the                      -
     - database, sampled at random intervals, was added to the recordings such that the
signal-to-noise ratio became dB (Liao and Gales              ). Results are from three of the
four test sets: Feb , Oct , and Feb (Sep was not used).                 e results that will be
presented are averaged over the three sets.
         e speech recogniser is based on the Resource Management recipe distributed with
the Hidden Markov Model Toolkit (Young et al.             ). It uses -dimensional feature
vectors consisting of           s and the log energy, and delta and delta-delta coe cients.
It is a cross-word state-clustered triphone system with six components per output distri-

                                                dB               dB
                       Clean                     .                .
                       Diagonal Joint            .                .
                       Full Joint                .                .

     Table . Reference system word error rates (in ) for Joint uncertainty decoding
     with regression classes. Numbers are reproduced from Liao and Gales (       a),
     tables , , and .

bution and a bigram grammar.       e Gaussian mixture models were trained with iterative
mixture splitting.

 .                    task
   e          task (Hirsch and Pearce        ) is a more extreme task than the Resource
Management one, with a small vocabulary and lower signal-to-noise ratios. e clean
speech was taken from the digits database of connected digits, and di erent real-world
noises were added. e clean data consists of          utterances by male and female
speakers. di erent noise conditions are provided: four signal-to-noise ratios ( dB,
  dB, dB, and dB) with additive noise recorded at four places: suburban train, crowd
of people (babble), car, and an exhibition hall. Each of the conditions has   sentences
for matched training and        sentences for testing.
       e reference speech recogniser uses -dimensional feature vectors consisting of
         s and the unnormalised log energy, and delta and delta-delta coe cients. e
acoustic models are whole word digit models, each with emitting states, three mix-
tures per state and silence and inter-word pause models.

 .       Reference systems
   is work nds a trade-o between the computational cost of noise-robustness tech-
niques and their recognition accuracy. Table . contains word error rates on both the
Resource Management task with a signal-to-noise ratio of dB and the                    task
with a signal-to-noise ratio of dB. It is felt that these conditions provide an interesting
balance between performance and di culty of the tasks. e numbers will illustrate why
to compensate models for noise with ( ) Joint uncertainty decoding; ( ) full covariance
Joint uncertainty decoding; and ( ) feature space transformations.
    In these noise conditions the clean models perform badly. Word error rates for the
diagonal Joint transformation, estimated on stereo data, make the case for model com-
pensation for noise with Joint uncertainty decoding. However, that the diagonal version
does not perform as well as the full-covariance bias one indicates the importance of full
covariances in noise.
    Table . on the facing page shows word error rates of single-pass retrained systems
(see section . on page ). ese systems e ectively provide upper bounds on the per-
                             . .                         -

                                                    dB                dB
                   Diagonal covariance               .                 .
                   Semi-tied covariance              .                 .

     Table . Matched system word error rates (in ). Numbers for diagonal systems
     are reproduced from Liao and Gales (  a), tables , and .

                              Iteration   Auxiliary function
                                                - .
                                   ½            - .
                                                - .

     Table . Values for the auxiliary function while training semi-tied transforma-
     tions on the Resource Management corpus. See gure . on the following page for
     the graphs.

formance of di erent forms of model compensation. e matched diagonal covariance
systems have the maximum obtainable performance with diagonal covariance matrices.
   e word error rates of the matched semi-tied covariance systems form the motivation
for the subject of this work: nding optimal feature space transformations for noise ro-

 .       Predictive semi-tied covariance matrices
Section . on page        has shown how to estimate semi-tied covariance matrices. e
procedure alternates between nding the new transformation A P and the new diagonal
             ˜ (m)
covariance Σ diag . Updating both constitutes an iteration. e latter half of an iteration
is one matrix chain multiplication, but the former half iteratively updates the rows of
the transformation in an inner loop. e number of iterations for both the inner and
the outer loop in these experiments is . Higher values did not improve the results
    Figure . (a) on the next page depicts the value of the auxiliary function, which gives
a lower bound of the log-likelihood, while estimating a -component predictive semi-
tied covariance transformation. It is clear that the large leap forward is taken during
the rst update of A P , the rst half iteration. It is unclear from gure . (a), however,
that the iterations a er that add anything. Figure . (b) therefore contains the graph
from that point only, and shows the monotonic increase that expectation-maximisation
guarantees. Table . has numbers on the great improvement in the rst half iteration,
and the small improvement a er that.
    Testing the di erent transformations shows that the performance follows the log-
                                                        e initial situation, with A P = I,
likelihood. Table . on page has the results.
                                                                  (r)                   ˜ (m)
results in a somewhat strange Joint transform, with a full A J and a diagonal Σ diag ,
which is known to lead to abysmal performance (Liao and Gales             ). Taking the rst
leap in log-likelihood, by running only half an iteration to nd the transformation in

                                           ½ iteration




                      −80             • Initial

                                      0               2          4             6            8   10
                                                      Outer training iteration
                                           (a)    e auxiliary function over iterations –





                                        ½ iteration

                                  0               2              4             6            8   10
                                                      Outer training iteration
                                          (b)     e auxiliary function over iterations ½–

                 Figure . Values for the auxiliary function while training semi-tied transforma-
                 tions on the Resource Management corpus. See table . on the preceding page for
                 the numbers.
                            . .

                                                      Word error rate ( )
                                         Diagonal                      .
                                         Full                          .
                                         Initial                       .
                  Predictive semi-tied   Half                          .
                                         Full                          .

   Table . Word error rates (in ) comparing Joint uncertainty decoding transfor-
   mations with predictive semi-tied covariance matrices on the Resource Manage-
   ment corpus at a signal-to-noise ratio of dB. Numbers in the top half are baseline
   numbers from table . on page .

                                                       Signal-to-noise ratio
                                                    dB       dB       dB     dB
                                   Diagonal          .         .       .      .
                                   Full              .         .       .      .
                                   Half              .         .       .      .
          Predictive semi-tied
                                   Full              .         .       .      .

   Table . Word error rates (in ) comparing Joint uncertainty decoding transfor-
   mations with predictive semi-tied covariance matrices on    . Numbers in top
   half are reproduced from Liao and Gales (     a), table .

( . ) on page , also yields a large leap in recognition accuracy. Running the other
 ½ iterations, the transformation turning out as in ( . ) on page , gives only another
 .    absolute reduction in word error rate.

    Perhaps surprisingly, estimating the predictive semi-tied covariance matrices with
the full scheme yields a better performance than the full Joint transformation — the form
that was approximated in the rst place. e paradox can be solved by the observation
that the semi-tied transformation can optimise the feature space in which the calcu-
lations take place, which Joint transforms cannot do. Indeed, the performance ceiling
for the predictive semi-tied system was given in table . on page . With a di erence
of only .     absolute, the predictive version based on the Joint uncertainty decoding
transform performs very well.

    Table . has word error rates on the         corpus at varying signal-to-noise ratios.
Comparing the higher signal-to-noise ratios with the lower, more extreme, ones, where
the covariance bias should have a larger in uence, two observations can be made. First,
both forms of semi-tied covariance matrices follow full Joint uncertainty decoding on
the heels, whereas the gap between diagonal and full Joint becomes larger. Second, with
a heavier covariance bias a transformation of the diagonalised covariance does become
more desirable. But, again, predictive schemes are able to keep up with the performance
of the original Joint transformation.






                                  0              2            4             6             8          10
                                                      Training iteration
                                      (a)       e auxiliary function over all iterations.




                                            2              4             6             8              10
                                                      Training iteration
                                      (b)        e auxiliary function a er iteration .

                 Figure . Values of the auxiliary function while training                covariance trans-
                 formations on the Resource Management corpus.
                                . .

                                                   Word error rate ( )
                                      Diagonal                      .
                                      Full                          .

     Table . Word error rates (in ) comparing Joint uncertainty decoding trans-
     formations with predictive transforms on the Resource Management corpus at a
     signal-to-noise ratio of dB. Numbers in the top half are baseline numbers from
     table . on page .

                                                    Signal-to-noise ratio
                                                 dB       dB       dB     dB
                                  Diagonal        .         .       .      .
                                  Full            .         .       .      .
                                                  .         .       .      .
                                                  .         .       .      .

     Table . Word error rates (in ) comparing Joint uncertainty decoding transfor-
     mations with predictive linear transformations on      . Numbers in top half
     are reproduced from Liao and Gales (     a), table .

 .       Transformations without covariance bias

  e predictive semi-tied covariance schemes retain a covariance bias. Sections . . on
page and . . on page have introduced approximations that do not require this,
and are therefore faster (see section . on page ): predictive         , and predictive
    Predictive           is basically the same as the half-iteration semi-tied covariance
scheme, but without the bias. It is estimated in the same way. However, since the initial
covariance is a worse approximation of the real covariance in noise, it takes more iter-
ations to converge to the nal transform. For the Resource Management experiments,
a er iterations performance does not increase. On the                task, however, which
has more extreme conditions, the             experiments need        iterations, unlike any
other transformations. Figure . (a) on the preceding page shows the auxiliary func-
tion while estimating a -component predictive covariance               transform, and g-
ure . (b) on the facing page zooms in on everything but the rst iteration.
      e results on the Resource Management task are in table . . Predictive             ,
the more powerful method, performs better than predictive           . e results on the
         task are in table . . Surprisingly enough, predictive         consistently per-
forms slightly better than predictive           .   is may be caused by the estimation
of predictive           , which uses expectation – maximisation, getting stuck in a local

                             Half semi-tied
                             Full semi-tied

     Table .     e total number of iterations (I J) performed to estimate the predictive
     transforms on the Resource Management and              tasks.

                                                           dB               dB
             Clean                                          .                .
                           Diagonal                         .                .
             Joint         Full                             .                .
                           Diagonal,        classes         .                .
                                                            .                .
                                                            .                .
                           Half semi-tied                   .                .
                           Full semi-tied                   .                .

     Table . Word error rates (in ) for the salient data sets at di erent signal-to-
     noise ratios. Transformations use regression classes except as noted. Numbers
     in the top half are reproduced from Liao and Gales (  a), tables , , and .

 .       Practical considerations
Section . on page has discussed the computational cost of the four proposed predic-
tive estimation schemes; this chapter has discussed the gains. e objective of this work
is to nd a way to strike a balance between the computational cost and precision. Joint
uncertainty decoding with a full covariance bias is felt to be too slow; Joint uncertainty
decoding with a diagonal covariance bias is felt to lose too much accuracy. Even when
the number of regression classes is increased, the performance of diagonal version tends
to plateau.
     Table . compares the relevant systems on the salient data sets. Diagonal and full
Joint with regression classes are the variants the predictive methods strike a balance
between. In practice, diagonal Joint with      regression classes is the competitor in terms
of e ciency. On the             task with the more extreme noise condition, a signal-to-
noise ratio of dB, all linear transformation methods e ect a gain in accuracy compared
to diagonal Joint.      is con rms the need for feature space transformations in more
adverse noise conditions.
     On the Resource Management corpus at a signal-to-noise ratio of dB only the
predictive semi-tied covariance schemes can compete with the Joint uncertainty decod-
ing transforms that use a covariance bias. e two schemes without covariance biases,
predictive            and predictive          , cannot keep up with even diagonal Joint
transformations. Predictive              is unlikely to be used in practice: the means of
the models needs to change anyway. While the adaptation scheme is at it, it might as
well transform the variances as well and get the reduction in error rate. Also, as table .
                                                 . .

shows, the total number of iterations under lower signal-to-noise ratios is higher for
predictive            .
        e e ciency of adaptation with predictive          , however, does not depend on
the number of models at all. Since it is so e cient, it may be an alternative for diagonal
Joint in embedded systems where computational speed is an important issue, or as an
initialisation for adapting      transforms.


   is work has looked into predictive linear transformations for noise robustness. On-
line speech recognisers, say systems embedded in cars, need to cope with changing noise
conditions. Joint uncertainty decoding can estimate a noise transform from only a few
frames of noise. However, adapting the models and decoding is slow. A linear adaptation
method such as             has little impact on decoding speed, but it needs noisy speech
data to train it on, so that recognition can not start right away. is work, by estimating
a Joint transform and using the statistics predicted by it to nd a linear transform, has
combined the instant availability of the one and the high throughput of the other.
    Various predictive linear transformation schemes have been presented.           e out-
come is that the cost of decoding with Joint transforms can be e ectively combated in
two ways.

Predictive semi-tied covariance matrices make it possible to use diagonal covariance
      matrices rather than Joint uncertainty decoding’s full ones.

Predictive          completely removes the need to change the models.

       e presented schemes have been tested on arti cially corrupted data. Predictive
semi-tied covariance matrices have yielded word error rates similar to the original full
Joint transformation, sometimes even better. Predictive        has shown word error
rates around those of diagonal Joint transformations, but without the cost of adapting
the models. In conclusion, Joint uncertainty decoding’s trademark covariance bias has
proven to be indispensable for optimal performance; but with the right feature space
transformation it does not have to be full.

 .     Future work
Estimating predictive transforms is essentially taking the long way around. e advan-
tage is that it needs only a few frames of data to estimate the intermediate transform,
which is found by Joint uncertainty decoding in this work. Arti cially corrupted data
has various advantages as a testbed, not the least of which is that it is easy to control
                                                                  . .

the circumstances. For testing predictive transforms this means that a transform of the
same form can be directly estimated on the data, providing the maximum obtainable
performance. If estimating transforms predictively from other transforms is the long
way around, then estimating them directly on stereo data is the shortcut. e di erence
between the two quanti es the di erence between the predictive transform and the ideal
    By comparing the performance of single-pass retrained and predictively trained sys-
tems, features space transformations from Joint uncertainty decoding estimations have
appeared almost as good as their ideal counterparts under laboratory conditions. e
next step is to apply the techniques pioneered in this work to real-world conditions,
and Joint transformations estimated with, for example, Joint adaptive training (Liao and
Gales       b). is will be the starting point for further research.

Alex Acero (     ). Acoustical and Environmental Robustness in Automatic Speech Recog-
  nition. Ph.D. thesis, Carnegie Mellon University.

Alex Acero, Li Deng, Trausti Kristjansson, and Jerry Zhang (       ). “HMM adapta-
  tion using vector Taylor series for noisy speech recognition.” In Proceedings of the
  International Conference on Spoken Language Processing. vol. , pp.    – .

Beverly Collins and Inger Mees (    ).   e Phonetics of English and Dutch. Brill, Leiden.

A. P. Dempster, N. M. Laird, and D. B. Rubin (   ). “Maximum Likelihood from In-
  complete Data via the     Algorithm.” Journal of the Royal Statistical Society ( ),
  pp. – .

V. V. Digalakis, D. Rtischev, and L. G. Neumeyer (      ). “Speaker Adaptation Using
   Constrained Estimation of Gaussian Mixtures.”       Transactions on Speech and Au-
   dio Processing ( ), pp. – .

M. J. F. Gales and P. C. Woodland (   ). “Mean and Variance Adaptation within the
         Framework.” Computer Speech and Language , pp.     – .

M. J. F. Gales and S. J. Young ( ). “Robust speech recognition in additive and convo-
 lutional noise using parallel model combination.” Computer Speech and Language ,
 pp.       – .

M. J. F. Gales and S. J. Young (    ). “Robust continuous speech recognition using
 parallel model combination.”       Transactions on Speech and Audio Processing ( ),
 pp. – .

Mark J. F. Gales ( ). Model-Based Techniques for Noise Robust Speech Recognition.
 Ph.D. thesis, Cambridge University.

Mark J. F. Gales ( a). “Maximum Likelihood Linear Transformations for            -based
 Speech Recognition.” Computer Speech and Language ( ), pp. – .

Mark J. F. Gales ( b). “Predictive Model-Based Compensation Schemes for Robust
 Speech Recognition.” Speech Communication ( - ), pp. – .

Mark J. F. Gales (   ). “Semi-Tied Covariance Matrices for Hidden Markov Models.”
       Transactions on Speech and Audio Processing ( ), pp.   – .
Hynek Hermansky (        ). “Perceptual linear predictive (    ) analysis of speech.”   e
  Journal of the Acoustical Society of America ( ), pp.       –     .

Hans-G¨ nter Hirsch and David Pearce (    ). “ e           experimental framework
  for the performance evaluation of speech recognition systems under noise condi-
  tions.” In Proceedings of -    . pp. – .

J. Junqua and Y. Anglade (       ). “Acoustic and perceptual studies of Lombard speech:
    application to isolated-word automatic speech recognition.” In Proceedings of the In-
    ternational Conference on Acoustics, Speech, and Signal Processing. pp. – .

Do Yeong Kim, Chong Kwan Un, and Nam Soo Kim (           ). “Speech recognition in
  noisy environments using rst-order vector Taylor series.” Speech Communication
    , pp. – .

C. J. Leggetter and P. C. Woodland (   ). “Maximum likelihood linear regression for
  speaker adaptation of continuous density hidden Markov models.” Computer Speech
  and Language ( ), pp. – .

Hank Liao and Mark J. F. Gales (   ). “Uncertainty Decoding for Noise Robust Auto-
  matic Speech Recognition.” Tech. Rep.      / -       / . , Cambridge Univer-
  sity Engineering Department.

Hank Liao and Mark J. F. Gales (     ). “Uncertainty Decoding for Noise Robust Speech
  Recognition.” In Proceedings of Interspeech.

Hank Liao and Mark J. F. Gales (    a). “Issues with Uncertainty Decoding for Noise
  Robust Speech Recognition.” Tech. Rep.       / -      / . , Cambridge Univer-
  sity Engineering Department.

Hank Liao and Mark J. F. Gales (   b). “Joint Uncertainty Decoding for Robust Large
  Vocabulary Speech Recognition.” Tech. Rep.      / -       / . , Cambridge Uni-
  versity Engineering Department.

Pedro J. Moreno (    ). Speech Recognition in Noisy Environments. Ph.D. thesis,
  Carnegie Mellon University.

P. Price, W. M. Fisher, J. Bernstein, and D. S. Pallett (    ). “ e                 -word
   resource management database for continuous speech recognition.” In Proceedings
   of the International Conference on Acoustics, Speech, and Signal Processing. vol. , pp.
       – .

A. J. Viterbi ( ). “Error bounds for convolutional codes and asymptotically optimum
  decoding algorithm.”      Transactions on Information eory , pp.        – .

S. J. Young and P. C. Woodland (      ). “State clustering in         -based continuous
   speech recognition.” Computer Speech and Language ( ), pp.          – .
Steve Young, Gunnar Evermann, Mark Gales,         omas Hain, Dan Kershaw, Xun-
   ying (Andrew) Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho
   Valtchev, and Phil Woodland (    ). “ e        book (for     Version . ).” URL

To top