Optimal Feature Spaces for Noise-Robust Speech Recognition . Girton College University of Cambridge Submitted for the degree of Master of Philosophy in Computer Speech, Text and Internet Technology June Declaration I, Rogier van Dalen, of Girton College, a candidate for the M.Phil. in Computer Speech, Text and Internet Technology, hereby declare that this dissertation and the work de- scribed in it are my own work, unaided except as speci ed, and that the dissertation does not contain material that has already been used to any substantial extent for any comparable purpose. I also declare that this dissertation contains words, includ- ing footnotes, appendices and bibliography, and that this is less than words, as prescribed in the Special Regulations of the M.Phil. examinations for which I am a can- didate. Acknowledgements I would like to thank my supervisor, Mark Gales, for coming up with the idea of this project, and helping me along throughout with answers, advice and new ideas. I would also like to thank Hank Liao for giving me access to his experimental set-ups, for his comments, and for gure . on page . Contents Introduction Automatic speech recognition . Parameter estimation with expectation – maximisation . . . . . . . . . . . Extracting features from audio . . . . . . . . . . . . . . . . . . . . . . . . . Noise robustness . Single-pass retraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel model combination . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Taylor series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joint uncertainty decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear transformations . Maximum likelihood linear regression . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Semi-tied covariance matrices . . . . . . . . . . . . . . . . . . . . . . . . . Linear transformations for noise robustness . Predictive linear transforms . . . . . . . . . . . . . . . . . . . . . . . . . . .. Predictive . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Predictive . . . . . . . . . . . . . . . . . . . . . . . . . . .. Half-iteration predictive semi-tied covariance matrices . . . . . . .. Predictive semi-tied covariance matrices . . . . . . . . . . . . . . . Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictive . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictive semi-tied covariance matrices . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiments . Resource Management task . . . . . . . . . . . . . . . . . . . . . . . . . . . . task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reference systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictive semi-tied covariance matrices . . . . . . . . . . . . . . . . . . . . Transformations without covariance bias . . . . . . . . . . . . . . . . . . . . Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter Introduction Automatic speech recognition has improved so much over the years that it is becoming a standard feature in mobile phones, call centres, and operating systems. e world market leader in dictation systems claims a – accuracy for its main product for desktop computers. Even if this estimate is valid, this degree of reliability requires a high-quality microphone and a noise-free environment. Noise is still a major stumbling block for speech recognisers. With the Resource Man- agement recipe distributed with the Hidden Markov Model Toolkit, the speech recog- niser obtains a . word error rate. When noise is added at a dB signal-to-noise ratio, this becomes . . Speech recognisers trained on clean data cannot handle noisy data. e goal of Joint uncertainty decoding (Liao and Gales ) is to compensate a speech recogniser for noise and integrate the uncertainty about the observations into the acoustic models without the computational cost that comes with marginalising over the noise components. It models the uncertainty caused by the noise with a covariance bias, which is added to the models’ covariance matrices. is makes decoding with Joint uncertainty decoding less e cient than is desirable. e rst problem is that models ac- quire full covariance matrices even when the original models have diagonal covariance matrices. Full covariance matrices cause much of the computational cost of evaluating Gaussian distributions. Liao and Gales ( ) found that simply diagonalising the re- sulting matrices results in poorer performance. e second problem is the existence of the bias at all. It makes adapting to changing noise conditions much more expensive. To solve these two problems, this work proposes approximations of Joint transforms that trade some accuracy for e ciency. For an example where e ciency is important, imagine an embedded speech recog- niser in a car, with no memory or time to spare. Given a few frames of background noise, it would instantaneously estimate a Joint transformation and convert it to a form that is cheap to use in decoding. Noise changes the nature of the resulting observation. e e ect of di erent noise conditions can be seen in gure . on the next page. e covariances are markedly dif- ferent. At the heart of the solution for the two problems with Joint uncertainty decoding . Figure . Real-world speech data at various noise conditions. Depicted are the means and variances of the rst two s. Changes in noise conditions cause changes in feature spaces. Figure provided by Hank Liao on data from Toshiba Research Europe Ltd. lies the intuition that noise changes the feature space. To compensate the recogniser for noise, the correlations that appear under the in uence of noise should be taken into ac- count. is work will consider four di erent linear feature space transformations. Two forms will be presented that solve problem ( ) by retaining a covariance bias, but mak- ing it diagonal. Two other forms fully eliminate the covariance bias, solving problem ( ). e latter are the most e cient. e linear transformations are normally estimated from audio data. In this work, however, they are estimated in a predictive fashion from Joint transformations. ey, in turn, are estimated on stereo data (clean speech and arti cially corrupted speech). is is not realistic: the uncorrupted speech is not available to the speech recogniser in the car. However, it does make it possible to evaluate the linear transformations against their ideal counterparts, estimated directly on the stereo data. ough this work is meant for readers with general knowledge of speech recogni- tion technology, chapter gives a short introduction. Chapter introduces methods for noise robustness. Chapter tells how linear adaptation methods work and how they are normally estimated. Against this background chapter explains how to estimate linear transformations from Joint transformations and details the consequences for the computational cost. Chapter brings the theory into practice and nds the resulting recognition accuracy. Chapter summarises the achievements in the light of practical noise-robust speech recognition. Chapter Automatic speech recognition State-of-the-art speech recognisers are based on probabilistic models. To extract the most likely words W from a sequence of observations O = o( ), . . . , o(T) , Bayes’ ˆ Rule is used: W = arg max P(W O) = arg max = arg max P(O W)P(W), ˆ P(O W)P(W) ( .) W W P(O) W so that the recognition process can be divided into the acoustic model P(O W), which determines how words are realised, and the language model P(W), which determines how likely a sequence of words is. is work focuses on the former. e acoustics of speech are modelled by hidden Markov models ( s). Hidden Markov models assume that every observation is generated by one state of a network, that transitions between states are probabilistic and depend only on the previous state (the Markov assumption), and that the observations are independent except through the states. Figure . contains a simple network modelling one phone (a small unit of speech), with a beginning characterised by state , a middle bit characterised by state , and an last bit characterised by state . States and are non-emitting states, used to connect networks. Concatenating phone networks creates word networks; sequences, or more complicated graphs, of word networks form sentence networks. e arrows between states indicate transition probabilities a i j , which are the probability of being in state j at time t given that the state at time t − was i. Speech recognition comes down to a a a a a a a Figure . A standard three-state hidden Markov model that represents a phone. States and are non-emitting states used to concatenate models. . nding the most likely path through the network X = x ( ), . . . , x (T) . e Viterbi algorithm (Viterbi ) does this. Observations consist of feature vectors. states model the observations with Gaussian distributions, or mixtures of Gaussian distributions. e probability of com- ( j) ponent m of state j generating the observation given that the state is j is given by c m , with m= c m = . Given that component m generated observation o(t), its distribu- M ( j) ( j) tion is modelled as a Gaussian distribution p(o(t) m) = N o(t); µ(m) , Σ(m) , ( . ) so that M ( j) p(o(t) x(t) = j) = b ( j) (o(t)) = c m ë N o(t); µ( jm) , Σ( jm) , ( j) ( . ) m= where o(t) is represented as a feature vector, and µ (m) and Σ(m) are the mean and the covariance, respectively, of the Gaussian. It is equation ( . ) that common adaptation schemes target. It is also the focus of this work. Depending on how “phone” is de ned, there are around phones in the English language (Collins and Mees ), but their realisations di er because of various factors that include the context. By introducing multiple models depending on the phones le and right, rather than one model per phone, co-articulation e ects can be taken into account. e resulting models are identi ed by the name of the le and right context, and the phone itself. ey are therefore called “triphones”. Because many combinations of three phones do not occur o en enough for them to be trained properly, several of them may be clustered with a decision tree, so that they are represented by one model (Young and Woodland ). . Parameter estimation with expectation – maximisation It would be best to train the parameters of hidden Markov models in such a way as to maximise the likelihood of the training data. is is called “maximum likelihood esti- mation”. Hidden Markov models contain two types of parameters that must be trained: the transition probabilities between states, and the output distributions for states. To nd maximum likelihood estimates for both, the state sequence must be known. e state sequence, however, is unobserved. Dempster et al. ( ) proposed an iterative algorithm for nding a maximum-likelihood estimation with incomplete data called “expectation – maximisation”. First a distribution of the missing data, the state sequence X, is found given the previous parameters λ(k) and the observations O. e auxiliary function Q λ, λ(k) is de ned as Q λ, λ(k) = log P(O, X λ)P(X O, λ(k) ), ( . ) X X . . – with X the space of all possible state sequences. e expectation – maximisation algo- rithm guarantees that an increase in the auxiliary function leads to an increase in the likelihood of the data. e new parameters λ(k+ ) are then set to the maximum likelihood estimate given the state sequence distribution and the observation: λ(k+ ) = arg max Q λ, λ(k) ( . ) λ is process is repeated until convergence. Applied to hidden Markov models, the expectation step nds the expected value of indicator variables that are if the state at time t is j, and otherwise. at is, it nds the state posteriors γ ( j) (t) = P x(t) = j O, λ(k) . ( . ) To nd the state posteriors, forward probabilities α ( j) (t) and backward probabilities β ( j) (t) need to be estimated. α ( j) (t) and β ( j) (t) can be recursively estimated from the beginning and end of the observation sequence, respectively. ey are de ned as α ( j) (t) = p o( ) . . . o(t), x(t) = j λ(k) , ( . ) β ( j) (t) = p o(t + ) . . . o(T) x(t) = j, λ(k) , ( . ) so that α (N) (T) = p O λ(k) . ( . ) e posterior probability is then given by α ( j) (t)β ( j) (t) γ ( j) (t) = P x(t) = j O, λ(k) = . ( . ) p O λ(k) By going through the observations and gathering statistics for every component, weighting by γ (m) (t), maximum-likelihood estimates for the new hidden Markov model parameters λ(k+ ) can be found as follows: T α (i) (t)a i j (t)b( j) (o(t + ))β ( j) (t + ) ai j = ˆ t= , ( . ) T t= α (i) (t)β (i) (t) T γ ( j) (t)o(t) µ( j) = ˆ t= , ( . ) T t= γ ( j) (t) γ ( j) (t) o(t) − µ ( j) T T o(t) − µ ( j) Σ( j) = ˆ t= . ( . ) T t= γ ( j) (t) To extend the parameter estimation to components, they can be seen as states. e forward-backward algorithm applies in exactly the same way. e posterior probabil- ity of component m is written γ (m) (t). It is sometimes convenient to express the total . occupancy of a component directly: T γm = γ (m) (t). ( . ) t= Setting model parameters by nding weighted averages for them in the observations is done not only for parameters, but also for some adaptation scheme parameters that will be discussed in chapter . To keep the number of parameters low, model trans- formations that make the model match the data better are usually tied over a group of components, called a regression class (Leggetter and Woodland ). Since it is not in general known in advance which components will bene t from the same transfor- mation, it is o en assumed that models with similar parameters will be similarly trans- formed. For example, regression classes can be found by bottom-up clustering based on the Kullback-Leibler distance between distributions. is results in the acoustic space being divided into regions. is work will denote regression classes by r. . Extracting features from audio e nature of the observations has not been discussed yet. ey consist of a vector of “features”. Since the raw audio samples do not ostentatiously carry much information, the samples are processed before they end up being represented as feature vectors. One feature vector represents a segment of audio so short, usually ms at every ms, that the speech signal can be assumed to be stable during this period. ( is by de nition breaks the assumption that s make that subsequent feature vectors are indepen- dent.) Taking a ms segment, applying a window (for example, a Hamming window), and then applying a Fourier transform produces the spectrum for the audio segment. When the log amplitudes of the spectrum are taken, the parameters in the log- spectral domain are found. To model the ear’s higher sensitivity to low frequencies, the log-magnitude spectrum is mapped onto the Mel scale by applying triangular windows to it. en the discrete cosine transform is used, so that the Mel frequency “cepstrum” (an anagram of “spectrum”) is obtained. e discrete cosine transform is a simpli ed version of the Fourier transform. It takes advantage of the fact that the log magnitude spectrum is real-valued and symmetric around . If B is the number of lterbank chan- nels, then the can be expressed as a matrix C with elements i( j − )π c i j = cos . ( . ) B By applying the discrete cosine transform to the Mel log-magnitude spectrum, Mel frequency cepstral coe cients ( s) are obtained. It is usual for feature vectors to contain Mel frequency cepstral coe cients and the energy. First-order and second- order di erentials are also added to capture the direction of the coe cient changes. Some methods have been proposed that increase the inherent robustness of the fea- tures against noise. e simplest method of adapting the data is called cepstral mean . . normalisation. It normalises cepstral feature vectors of an utterance by subtracting the mean. is compensates for a linear lter on the original signal. Its simplicity and e ec- tiveness have made it ubiquitous. Another feature extraction scheme is called perceptual linear predictive ( ) anal- ysis (Hermansky ). e coe cients resulting from it are slightly more noise-robust than s. Chapter Noise robustness Speech recognisers are usually trained on clean speech data (recorded with a high- quality microphone, with little background noise). ey are o en used on noisy data (e.g. a mobile phone-style microphone in a noisy environment). e mismatch between the models and the observation causes performance to plummet. is can be alleviated in two ways. One is to make the model match the observations; the other to make the observations match the model. Doing either requires a model of the di erence between the clean speech (hopefully modelled by the clean acoustic models) and the observa- tions. If the signal x[m] is the clean speech, h[m] is the convolutional noise, capturing microphone and room characteristics, and n[m] is the additive noise, capturing back- ground noise, then the standard model of the in uence of acoustic noise is (Acero ) y[m] = x[m] h[m] + n[m]. ( .) is assumes that the microphone and room characteristics can be characterised by a linear channel lter h[m]. As the dynamic Bayesian network in gure . shows, it also assumes that the clean speech is independent of the noise, which is unlikely to be true since people counteract noise by altering their speech. is is called the “Lombard ef- Figure . Dynamic Bayesian network of the noise model, with the emitting states shaded. Reproduced with permission from Liao and Gales ( b). . . - fect”, it is described by Junqua and Anglade ( ) and this work ignores it. Speech recognisers commonly use s (see section . on page ), so ( . ) must be reformulated as a function of the cepstral descriptions of the clean speech vector, x, of the convolutional noise, h, and of the additive noise, n (without di erentials). Using the discrete cosine transform C (see section . on page ) and its rows c i , the elements of the noisy speech vector y are given by = (x+h) − − yi c i log e C + eC n ( . ) = x i + h i + c i log( + e (C (n−x−h)) ). − ( . ) is formulation makes clear that the in uence of noise in the cepstral domain is highly non-linear. is work adapts the models to the data to compensates for noise. Section . dis- cusses single-pass retraining, which trains new models from stereo data. Model adap- tation techniques that improve noise robustness speci cally approximate ( . ). Sec- tions . – . discuss parallel model combination, vector Taylor series, and Joint uncer- tainty decoding, respectively. . Single-pass retraining e models can be made to better match the data by taking a speech recogniser trained on clean speech and retraining it on noisy speech. However, a clean speech recogniser is not straight away going to get the posteriors right when fed noisy speech. For the arti cially corrupted corpora used in this work, however, the clean data is also available. It is therefore possible to nd the posteriors from the clean models on the clean data, but accumulate statistics for the new parameters from the noisy data. is is called single- pass retraining (Gales ). e component posteriors are given by γ (m) (t) = P(x(t) = m S, λ), ( . ) where S is the clean speech data. e means and variances are then trained with T γ (m) (t)o(t) µ(m) = t= , ( . ) T t= γ (m) (t) γ (m) (t) o(t) − µ (m) T T o(t) − µ (m) (m) = t= Σ . ( . ) T t= γ (m) (t) e component weights and transition probabilities are not changed. Single-pass retraining provides a ceiling for the performance of the predictive trans- formations. e Joint transformations in this work have been estimated from stereo data, and form the statistics on which the linear transformations are estimated. e ideal predictive transforms would therefore equal to those directly estimated with single-pass retraining. . . Parallel model combination Parallel model combination (Gales and Young ) is a model compensation technique. It can compensate for additive noise through a noise model estimated from a few frames with only noise. With a small amount of adaptation data, it can also handle convolu- tional noise. It operates in the cepstral domain and modi es both the means and the variances of a model set. To do this, it uses a mismatch function that approximates the e ects of the noise on the speech parameters, which are more naturally described in the log-spectral domain. e log-normal distribution is popular. It assumes that the sum of two log-normally distributed variables (the speech and the noise) is also log-normally distributed. In the spectral domain, this technique therefore matches only the rst two moments of the corrupted speech distribution. At dB, parallel model combination has been found to restore recognition performance to that of a recogniser trained on the corrupted speech (Gales and Young ). . Vector Taylor series Equation ( . ) cannot be used directly to compensate the clean speech models for noise: even if x, h, and n are all assumed to be Gaussian distributed, y will not be. By linearising it with a truncated vector Taylor series (Moreno ; Kim et al. ; Acero et al. ), an approximation of the compensated model parameters can be derived. Only a small amount of data is needed to nd the statistics that this method needs: the means of the convolutional and additive noise, µ h and µ n , and the variance of the additive noise Σ n . e notation µ (m) for the Taylor series expansion point will be used to indicate that y (m) (m) y is evaluated at clean speech mean µ x , convolutional noise mean µ h , and additive noise mean µ n . e rst-order vector Taylor series approximation of ( . ) for compo- nent m is yi = yi ˆ(m) (m) µy + ∇x y i µy (m) ë (x − µ (m) ) x + ∇n y i (m) µy ë (n − µ n ) + ∇ h y i (m) µy ë (h − µ h ). ( . ) e resulting compensated model parameters for the static features of component m are found by approximating the rst and second moment of y (m) (Liao and Gales ˆ b): µ y,i = E y i (m) ˆ (m) ( . ) = yi (m) E yi ˆ (m) µy ( . ) = µ x ,i + µ h,i + c i log( + e (C ), (m) − (µ n −µ (m) −µ h )) x ( . ) (m) where E ë is the expected value for component m. . . e rst-order vector Taylor series approximation is needed to nd the in uence of the noise on the variance: Σ(m) = E y y T (m) T y − µ(m) µ(m) y y ( . ) (m) T E y yT ˆˆ − µ(m) µ(m) y y ( . ) T ∂y ∂y Σ(m) x ∂x (m) µy ∂x µy (m) T T ∂y ∂y ∂y ∂y + Σh + Σn ( . ) ∂h (m) µy ∂h (m) µy ∂n µy (m) ∂n (m) µy T T ∂y ∂y ∂y ∂y Σ(m) x + Σn . ( . ) ∂x (m) µy ∂x µy (m) ∂n µy (m) ∂n (m) µy ( . ) makes the assumption that the clean speech and noise are independent; ( . ) that the channel noise is constant so that Σ h = . Σ x can be found from the model; Σ n can (m) (m) (m) be found from a few frames of noise. e resulting Σ y is not diagonal, even if Σ x and Σ n are, but is o en diagonalised to make decoding more e cient. Some manipulation shows that the Jacobian matrices are given by = = I − CFC − ∂y ∂y ( . ) ∂x (m) µy ∂h (m) µy = CFC − , ∂y ( . ) ∂n µy (m) where F is a diagonal matrix whose elements are given by c − µ n −µ (m) −µ h fi i = e i x ( . ) c − µ n −µ x −µ h (m) +e i It is interesting to see the e ect of the signal-to-noise ratio on the variance that this (m) model predicts. f i i varies between and depending on the value of µ n − µ x − µ h . ∂y If the noise level µ n is high, f i i will tend to , causing ∂n (m) to tend to the identity µy ∂y matrix and ∂x µ (m) to zero. e resulting variance, in equation ( . ), therefore will y tend to the variance of the noise Σ n , and be small. If the noise level is low, the opposite (m) will happen, and the variance will tend to the clean speech variance Σ x . is provides an elegant way of accounting for changes in the covariance because of noise that were seen in gure . on page . However, compensating models with vector Taylor series is computationally expensive, since it requires the matrix multiplications in ( . ) for every component. . . Joint uncertainty decoding Joint uncertainty decoding (Liao and Gales ; b) is a model compensating tech- nique derived from a model of the joint distribution of the clean and the noisy speech. It assumes that this distribution is Gaussian. If s(t) is a clean speech vector (including rst- and second-order di erentials), and o(t) is the corresponding observation, then s(t) µs Σs Σ so N , , ( . ) o(t) µo Σ os Σo with parameters speci c to the clean speech model state and the noise model state. If the uncertainty decoding is done in the front-end, then Joint uncertainty decoding partitions the corrupted acoustic space into regions, for each of which a conditional dis- tribution p(o(t) s(t)) is estimated. Model-based Joint uncertainty decoding, however, ties this conditional distribution to the model components: every component belongs to one regression class r. Components are compensated for the noise characteristics of the region of the acoustic space their means are in. e distribution of the corrupted speech for component m in regression class r be- comes p(o(t) m) = A(r) ë N A(r) o(t) + b(r) ; µ(m) , Σ(m) + Σ b (r) ( . ) where = − A(r) Σ(r) Σ(r) , s os ( . ) b (r) = µ(r) s − A(r) µ(r) , o ( . ) = (r) (r) (r) (r) T Σb A Σo A − Σ(r) . s ( . ) Model based Joint uncertainty decoding forms the basis of this work. In e ect, it applies (r) a piecewise linear transformation to the acoustic space. It also adds a bias Σ b to the covariance, modelling the changes in the variance of noise-corrupted speech. To obtain the parameters of the joint distribution in ( . ), they can be estimated from stereo data, with the clean speech and the noisy speech. is is relatively straight- forward, and this technique is used in the work. It is also possible to nd the parameters from the vector Taylor series approximation, or to use Joint adaptive training (Liao and Gales b). Joint uncertainty decoding with the form in ( . ) works well (Liao and Gales a). However, the compensated models’ covariances become full, increasing the computa- tional complexity of decoding with a factor of d, the dimensionality of the features. e (r) simple solution is to diagonalise Σ b . However, doing that while not diagonalising A(r) is mathematically wrong and leads to extremely bad accuracy (Liao and Gales ). Di- agonalising both, reducing the complexity, also reduces accuracy. is work will apply feature space transformations to reduce the computational complexity of Joint uncer- tainty decoding without reducing the accuracy. e next chapter will introduce these transformations. Chapter Linear transformations Starting out with an already trained speech recogniser makes it possible to resolve the mismatch using fewer parameters than training a recogniser from scratch would take. Chapter has discussed techniques for resolving the mismatch because of noise specif- ically. is chapter discusses linear transformations. e original form of maximum likelihood linear regression (Leggetter and Woodland ) transformed only the mean and was meant speci cally to adapt the models to a speaker. Because of their generic nature, however, linear transformations can solve not only mismatches in the speaker accent, speaking style, and voice quality, but also in the noise condition. Because they use fewer parameters than full speech recognisers, they can be trained on much less data than it would take retrain models individually, and they can be estimated on-line. Semi-tied covariance matrices (Gales ) have a similar form. ey provide rota- tion matrices that apply to observations, means, and variances. is means that the ob- servation likelihood is calculated in a di erent feature space. e original objective was to allow the Gaussians to have diagonal covariance matrices. erefore, the algorithm for estimating semi-tied covariance matrices nds feature spaces in which diagonalising the covariance bias is reasonable. . Maximum likelihood linear regression Maximum likelihood linear regression ( ) transforms in their most general form transform both means and covariances of Gaussian distributions (Gales and Woodland ). e new mean µ(m) and covariance Σ(m) of component m become ˆ ˆ µ(m) = A(r) µ(m) − b(r) ′ ′ ˆ ( .) and Σ(m) = H (r) Σ(m) H (r) , T ˆ ( . ) with component m in regression class r. Given a small amount of training data, it is ′ ′ possible to nd maximum likelihood estimates for A(r) , b(r) , and H (r) . . Various speci c forms of have been proposed. e original paper (Leggetter and Woodland ) transformed only the means. is work will look at two other forms: one that applies the same transform to means and covariances, called constrained , and one that adapts only the covariances, called covariance . .. e special case A(r) = H (r) is one of the transformations to be considered in this ′ work. It is called contrained maximum likelihood linear regression ( ). Digalakis et al. ( ) introduced the diagonal transform case; Gales ( a) extended it to full transforms. It transforms the models by µ(m) = A(r) µ(m) − b(r) ′ ′ ˆ ( . ) Σ(m) = A (r) ′ (r) ′ T ˆ Σ(m) A . ( . ) Its advantages come to the light, however, when it is written as a transformation of the observations: ′− ′− o t = A(r) o(t) + A(r) b(r) = A(r) o(t) + b(r) , ′ ˆ ( . ) so that the observation likelihood becomes L(o(t); µ (m) , Σ(m) , A(r) , b(r) ) = A(r) ë N (A(r) o(t) + b(r) ; µ(m) , Σ(m) ). ( . ) is means that each environment-in uenced feature vector is transformed to the feature space that the models in a regression class expect. Conceptually, it is a piecewise linear transformation, because Gaussians are clustered as a regression class based on their distance to each other. Computationally, models can calculate the observation likelihood on the appropriately transformed feature vector. is makes decoding fast and adaptation non-invasive so that changes in the environment are easily compensated for. To obtain a maximum likelihood estimation for A(r) and b(r) , start out by formu- lating them as an extended transformation matrix W (r) = A(r) b(r) . e extended observation vector is ζ(t) = o(t) , so that o t = A(r) o(t) + b(r) = W (r) ζ(t). ˆ ( . ) W (r) is estimated in an iterative fashion, with one row being updated at a time. e updated ith row of the transform is given by w i = α p i + k (i) G (i) , − ( . ) where p i is the extended cofactor row vector [c i ... c in ] (with c i j = cof(A i j )). . . e statistics from the adaptation data that are needed are M T G (i) = (m) γ (m) (t)ζ(t)ζ(t)T , ( . ) m= σ i t= and M T k (i) = γ (m) (t)ζ(t)T , (m) (m) µi ( . ) m= σi t= and α satis es α p i G (i) p i T + α p i G (i) k (i) − β = , − − T ( . ) with M T β= γ (m) (t). ( . ) m= t= e computational complexity of this algorithm is dominated by the cost of calculat- ing the cofactors and the inverse of G (i) . e latter costs O(d ) per matrix (with d the dimension of the feature vector). A naive implementation of the former costs O(d ) per matrix per iteration, but using the Sherman-Morrison matrix inversion lemma this can be reduced to O(d ) (Mark Gales, personal communication). us, for R transforms and I iterations, the cost of estimating the transforms is O(RId + Rd ). is does not take into account the cost of gathering the statistics. is work does not use statistics from data, but predicts statistics based on the models and the Joint transform. Section . . on page shows how to generate the predicted statistics and section . on page shows the complexity. .. Covariance Covariance updates only the covariances, i.e. it only uses ( . ). Just like con- strained it is most e cient when used the other way around. Rather than trans- forming the covariances, the observations and the means are transformed, yielding L(o(t); µ (m) , Σ(m) , A(r) ) = A(r) ë N A(r) o(t); A(r) µ(m) , Σ(m) . ( . ) A(r) is estimated in an iterative fashion, with one row being updated at a time. With every update the value of the auxiliary function increases. e statistics required are the occupation-weighted summed covariance from the data, W (m) = γ (m) (t) o(t) − µ (m) T o(t) − µ (m) . ( . ) t From that, a matrix is found for every dimension i M (r) W (m) G (i) = (m) , ( . ) m= σi (m) with σ i the ith element of the leading diagonal of Σ(m) . . e update formula for row i of A(r) is M (r) γ (m) (t) = c i G (i) (r) − m= t ai , ( . ) ci G (i) − ci T where c i is the row of the cofactors of A. Just as for , calculating the inverse of G (i) and nding the co-factors form the main computational cost, so that estimating transforms for R regression classes in I iterations is O(RId + Rd ), with d the size of the feature vector. does need to transform all model means, which takes O(Md ). Again, this does not take into account the cost of gathering the statistics, because this work uses predicted statistics. Section . . on page shows how to generate the predicted statistics and section . on page shows the complexity. . Semi-tied covariance matrices A compromise between the speed of diagonal covariance matrices and the modelling accuracy of full ones has been found earlier. Gales ( ) proposed a scheme in which diagonal covariance matrices share one rotation matrix per regression class. e algo- rithm nds a transformation into a feature space in which a diagonal covariance matrix is a more valid assumption than in the original feature space. e e ective covariance ˜ (m) matrix for component m is composed of a diagonal matrix Σ diag and a transformation A(r) applied to it: Σ(m) = A(r) Σ diag A(r) . T (m) ˆ ˜ ( . ) e observation likelihood becomes L o(t); µ (m) , Σ diag , A(r) = A(r) ë N A(r) o(t); A(r) µ(m) , A(r) Σ diag A(r) , ( . ) ˜ (m) T (m) ˜ which indicates its relation to the observation likelihood of , in ( . ). is is also a feature-space transformation. However, transforms just the models; semi-tied covariance matrices transform both transformations and the model. is points to the di erent purposes of the transformations, even though they are both linear transforma- tions. assumes that the model is perfectly valid, but the observations and the model are in di erent feature spaces. Semi-tied covariance matrices assume that the observations and the models should be in the same feature space, but recognise that the model, with its diagonal covariance matrix, is awed and transform the space in which the calculations are done. ˜ (m) To estimate the parameters for semi-tied covariance matrices, Σ diag and A(r) must be found simultaneously. is is done in an expectation – maximisation fashion. Σ ˜ (m) diag (m) is initialised to the diagonalised original covariance, diag Σ diag . In the rst step, the transformation A(r) is updated in the same way as is done for transforms, de- scribed in section . . . However, the current estimate for the covariance, which changes . . - ˜ (m) ˜ (m) every iteration, is used; σdiagi is the ith element of the leading diagonal of Σ diag . In the second step, Σ˜ (m) is set to the maximum-likelihood diagonal covariance in the feature diag space given by A(r) . is is repeated until convergence. e full procedure is as follows. Repeat J times: . Estimate A(r) as in section . . : ˜ (m) Given the current estimate for Σ diag , iterate over the rows of A(r) , updating each I times. Row i of A(r) is set to M (r) γ (m) (t) = ci G (i) − (r) m= t ai − ( . ) c i G (i) c i T where c i is the row of the cofactors of A and M (r) W (m) G (i) = ( . ) m= ˜ (m) σdiagi W (m) = γ (m) (t) o(t) − µ (m) )(o(t) − µ (m) T . ( . ) t (m) . Estimate Σ diag . W (m) is the occupancy-weighted summed covariance of the data. ˜ e maximum-likelihood estimate for Σ ˜ (m) is the covariance in the feature space diag expressed by A(r) : T A(r) W (m) A(r) Σ diag = diag ˜ (m) (m) (t) . ( . ) tγ is scheme allows diagonal matrices to be used in encoding because they now work in a di erent feature space. is reduces the number of variables to be estimated while retaining much of the gain in recognition accuracy of full covariance matrices. Just like for and , the computational complexity of this algorithm is dominated by the cost of calculating the cofactors and the inverse of G (i) . e former costs O(d ) (with d the dimension of the feature vector) per dimension per iteration. e latter costs O(d ) per dimension. us, for R transforms, J outer loop iterations, and I inner loop iterations, the cost of estimating the transforms is O(RJId + RJd ). is does not take into account the cost of gathering the statistics. is work does not use statistics from data, but predicts statistics based on the models and the Joint transform. Section . . on page shows how to generate the predicted statistics and section . on page shows the complexity. Chapter Linear transformations for noise robustness Section . on page has introduced Joint uncertainty decoding, which leads to a trans- formation p(o(t) m) = A J (r) (r) (r) (r) ë N A J o(t) + b J ; µ(m) , Σ(m) + Σ b , ( .) (r) (r) (r) in which A J , b J , and Σ b are the Joint transform parameters. e problem with Joint (r) (r) uncertainty decoding that this work addresses is the covariance bias Σ b . If Σ b is full, all covariances become full, and performance decreases. Decoding with full covariance matrices costs O(TMd ), with T the number of observations, M the number of com- (r) (r) ponents, and d the size of the feature vectors. Decoding with diagonal A J and Σ b , on the other hand, costs O(TMd), but loses accuracy. e full covariance bias models the change in feature space that the noise causes. us, decoding with better feature spaces may obviate the need for the full covariance bias, or, indeed, for any covariance bias at all. e linear transformations discussed in chapter transform the feature space without the detrimental e ect on decoding performance that Joint uncertainty decoding’s full covariance bias has. erefore, they will be applied to Joint transforms. e algorithms for estimating the linear transforms that are investigated in this work remain the same. ey still nd the optimal transforms in a maximum likelihood sense. e di erence is in the data source. e linear transformations are normally estimated based on statistics from actual data. In this work, the statistics used are the expected values of the statistics given the models and the Joint transform. e distribution in ( . ) is assumed to be the actual distribution of the noisy data. For example, the expected (r) covariance for component m in regression class r is Σ(m) +Σ b . As the linear transforms are estimated on the predicted properties of the data, they will be called “predictive” linear transforms as in Gales ( b). Section . discusses how the predictive transforms can be estimated from the mod- . . els and the Joint transform. Section . discusses the computational complexity. . Predictive linear transforms In this work the predictive transforms will be estimated from a Joint transform. eir parameters will be indicated by a P subscript. First, a short overview of the transforms will be given. Four types of linear transformations will be considered. Predictive (see also section . . on page ) nds an optimal linear transfor- mation that is applied only to the observations. Predictive (see also section . . on page ) nds a feature space in which the model’s original covariance models the actual covariance well. Half-iteration predictive semi-tied covariance matrices (see also section . on page ) is similar to predictive , but it does add the diagonalised covariance bias. Predictive semi-tied covariance matrices (see also section . ) use a per-regression class rotation of observations and model parameters to perform the likelihood calculation in another feature space. .. Predictive Contrained maximum likelihood linear regression nds a feature transformation that does not change the model but only transforms the observations. ( . ) is changed by taking out the covariance bias and applying the transformation in ( . ) on page . It then becomes p(o(t) m) = A P ë A J (r) (r) (r) (r) (r) (r) ëN A P A J o(t) + b J + b P ; µ(m) , diag Σ(m) . ( . ) is assumes that the mean and covariance of the original model are still correct for noisy speech, but does transform the noise-corrupted observations to another feature space. (r) (r) e procedure to estimate parameters A P and b p detailed in section . . on page can be followed. e statistics needed in that procedure, matrices G (i) and vectors k (i) , are replaced by their predicted values. Recall that the statistics were ex- pressed in terms of the extended feature vector ζ(t) = o(t) . Equation ( . ) on page shows how G (i) is found in the original algorithm. It sums the maximum likelihood estimates of the second moment about of the distribution of the extended observations, weighted by the diagonal entries of the variances of the models. Its predicted value can be found using expected rst moment µ (m) and the . (r) expected second central moment Σ(m) + Σ b of the components’ distributions: M (r) T G (i) = (m) E γ (m) (t)ζ(t)ζ(t)T ( . ) m= σ i t= (r) T M (r) Σ(m) + Σ b + µ(m) µ(m) µ(m) = γm (m) T . ( . ) m= σi µ(m) ( . ) on page gives the original value of k (i) . It is the sum of the observations weighted by occupancy and the model parameters. e component mean is the average of the observations weighted by occupancy, so that the predicted value for the statistics in ( . ) on page can be found by M (m) T k (i) = γ (m) (t)ζ(t)T µi (m) E ( . ) m= σi t= M (m) = γm µi T (m) µ(m) . ( . ) m= σi .. Predictive Predictive nds a feature space in which the model’s original covariance is valid. e mean, however, is transformed to the new feature space. ( . ) is changed by taking out the covariance bias and applying the transformation in ( . ) on page . It then becomes p(o(t) m) = A P (r) (r) (r) (r) (r) (r) ë AJ ë N AP A J o(t) + b J ; A P µ(m) , diag Σ(m) . ( . ) e statistics that need to be found to estimate A(r) are the W (m) . From equa- tion ( . ) it can be seen that these are the occupancy-weighted summed covariances. e predicted value for them is W (m) = E γ (m) (t) o(t) − µ (m) T o(t) − µ (m) ( . ) t = Σ(m) + Σ b (r) γm . ( . ) en G (i) from equation ( . ) becomes M (r) W (m) M (r) G (i) = = γm (r) (m) (m) Σ(m) + Σ b . ( . ) m= σdiagi m= σdiagi . . .. Half-iteration predictive semi-tied covariance matrices e scheme laid out in section . on page for estimating semi-tied covariances is (r) ˜ (m) computationally expensive, because it alternates over updating A P and Σ diag , and up- (r) (r) dating A P already requires iterating over its rows. An alternative is to update only A P . Applied to ( . ), this results in a similar form to ( . ), but the covariance bias is retained in a diagonalised form. p(o(t) m) = A P (r) (r) ë AJ ë (r) (r) (r) (r) (r) N AP A J o(t) + b J ; A P µ(m) , diag Σ(m) + Σ b . ( . ) is assumes that the diagonalised predicted covariance is valid in another feature space. e transformation A(r) is optimised for this feature space. e rst step of the proce- dure detailed in section . on page can be followed. Analogously to the (m) transform, the predicted value for W is W (m) = Σ(m) + Σ b (r) γm , ( . ) so that G (i) (see equation ( . ) on page ) becomes M (r) W (m) M (r) G (i) = = γm (r) Σ(m) + Σ b . ( . ) m= ˜ (m) σdiagi m= ˜ (m) σdiagi .. Predictive semi-tied covariance matrices Semi-tied covariance matrices are the most powerful linear transforms that this work discusses. ey transform the calculation of the Gaussian distributions in a regression class to another feature space to make the assumption that the covariance matrix is di- agonal, convenient in terms of the computational cost, true. e covariance bias in ( . ) is not removed, but by applying ( . ) on page , it is transformed and diagonalised: p(o(t) m) = A P (r) (r) ë AJ ë (r) (r) (r) (r) (r) (r) (r) T N AP A J o(t) + b J ; A P µ(m) , diag A P Σ(m) + Σ b AP . ( . ) Since this transforms the observations, means, and variances, the distribution is essen- (r) tially calculated in another feature space. e transformation A P is optimised to allow diagonal covariances in this feature space. e full procedure detailed in section . on page can be followed. e values for W (m) and G (i) are the same as in section . . . Equation ( . ) on page sets the diag- onal covariance of the model to the diagonalised transform covariance of the training (r) data. e predicted value of the covariance is Σ(m) + Σ b , so that Σ diag = diag A P Σ(m) + Σ b A P T ˜ (m) (r) (r) (r) . ( . ) . . Computational complexity e current issue with using Joint uncertainty decoding in practice is the diagonal co- variance matrices. ese make decoding of T observations O(TMd ). With diagonal covariance matrices this becomes O(TMd). is gain is acquired at the expense of extra time spent to estimate the linear transformation and transform the models. e purpose of this work is to reduce the computational complexity of Joint uncer- tainty decoding. e ultimate goal is to instantaneously estimate a Joint transform from noise and estimate an approximation that allows fast decoding. For this, statistics that (r) depend on the model (µ (m) , Σ(m) ) but not on the Joint transform (Σ b ) should be pre- calculated. is reduces the on-line processing time. e complexity of the schemes proposed in this work are functions of the number of components M, the number of regression classes R, the dimensionality of the feature vectors d, the number of observa- tions T, the number of inner loop iterations I, and the number of outer loop iterations J. e next sections discuss the complexities of the algorithms from section . . To nd the lowest possible cost, some of the statistics must be rewritten to take advantage of the o -line availability of the models. . . Predictive Predictive nds a feature space transformation that does not necessitate adapting the models at all. is makes the complexity independent of the number of components and makes it the fastest predictive linear transformation discussed in this work. Equation ( . ) looks like a O(Md ) operation. However, rewriting it as T (r) M (r) Σ(m) + µ(m) µ(m) µ(m) M (r) = (i) γm Σb γm G (m) + (m) ( . ) m= σ i (m) T m= σ i µ cached cached allows most of the statistics to be cached, so that the complexity becomes O(Rd ). As equation ( . ) shows, assembling k (i) can be fully done o -line. . . Predictive Predictive nds a transformation that does not rely on the covariance bias. It turns out that therefore the part of G (i) that depends on the model can be found o -line. ( . ) can be rewritten to G (i) = γm (r) γm (m) Σ(m) + Σ b (m) , ( . ) m σdiagi m σdiagi cached cached so that the total on-line cost of nding G (i) is O(Md ). e time needed to nd the transformation does not depend on the number of components. However, since the . . Half Full semi- semi-tied tied Finding G (i) O(Rd ) O(Rd ) O(Md ) O(JMd ) Inverting G (i) O(Rd ) O(Rd ) O(Rd ) O(JRd ) Calculating c i O(IRd ) O(IRd ) O(IRd ) O(JIRd ) Setting Σ(m) ˜ O(Md) O(JMd ) Setting µ(m) ˜ O(Md ) O(Md ) O(Md ) Table . e complexity of estimating predictive transforms from a Joint transform. d is the number of dimensions of the feature vector; M is the number of components; R is the number of regression classes; I is the number of inner iterations; J is the number of outer iterations (see section . on page ). model means do need to be transformed, applying the transformation takes O(Md ). . . Predictive semi-tied covariance matrices e computational complexity of predictive semi-tied covariance matrices does not have to be as great as it may seem from section . . . Two observations can be made. e rst one is that if the models’ covariance matrices Σ(m) are diagonal, it makes sense to split the calculation of G (i) up into two parts. ( . ) suggests an O(Md ) com- plexity, where M (r) is the number of models in regression class r and d is the dimen- sionality of the feature vectors. ( . ) can, however, be written as M (r) M (r) G (i) = γm (r) γm Σ(m) + Σ b , ( . ) ˜ (m) m= σdiagi ˜ (m) m= σdiagi O(M (r) d ) O(d + M (r) ) so that the total cost is O(Md ), assuming M Rd. Similarly, ( . ) suggests a O(Md ) cost. By again assuming that Σ(m) is diagonal, Σ diag = diag A P Σ(m) A P T (r) T ˜ (m) (r) (r) (r) + diag A P Σ b A P (r) , ( . ) O(M (r) d ) O(d ) making the total cost O(Md ) as well. For the full scheme, the complexities are multiplied by the number of outer iter- ations J. e half-iteration scheme for predictive semi-tied covariance matrices only nds a transformation matrix A(r) and does not adjust the covariances Σ(m) , apart from ˜ adding the covariance bias, which is O(Md). . . . Summary Table . on the preceding page details the time requirements for the approximations for Joint transforms discussed in the previous sections. e naive implementation for cal- culating the cofactors c i takes O(Rd ) per iteration, but using the Sherman-Morrison matrix-inversion lemma this can be reduced to O(Rd ) per iteration (Mark Gales, per- sonal communication). Inverting G (i) takes O(Rd ) per iteration. By using the average ˜ (m) ˜ (m) of the diagonal of Σ diag rather than di erent σdiagi for i d, it may be possible to reduce this to O(Rd ) (Mark Gales, personal communication). In all cases, by allowing for diagonal covariances on the models compensated for noise, the complexity asso- ciated with decoding T observations with Joint uncertainty decoding is reduced by a factor of d. Chapter Experiments Chapter has discussed linear transformations for noise robustness that this chapter will explore in practice. e experiments are conducted using the Hidden Markov Model Toolkit ( ). An adapted version the . from Liao and Gales ( a) was used to estimate the Joint transforms on stereo data. e code for predictive linear transforma- tions was added to . , which supports semi-tied covariance matrices. Two extra commands were implemented in the model editor HHEd. Both take a model, a Joint transform, and a statistics le and estimate a linear transform and a new model. esti- mates a predictive semi-tied covariance transform or a predictive transform; the command estimates a predictive transform. A noise-corrupted version of the Resource Management corpus and the task provide a testbed. Since the approximations aim at nding the best trade-o between accuracy and speed, they will be compared with diagonal Joint distributions with similar complexities. . Resource Management task e naval Resource Management corpus (Price et al. ) is a medium vocabu- lary ( words) database. It was recorded in a sound-isolated room using a head- mounted Sennheiser noise-cancelling microphone yielding a high signal-to- noise ratio of dB. speakers read sentences, varying in length from to seconds, of prompted script. e Destroyer Operations Room noise from the - - database, sampled at random intervals, was added to the recordings such that the signal-to-noise ratio became dB (Liao and Gales ). Results are from three of the four test sets: Feb , Oct , and Feb (Sep was not used). e results that will be presented are averaged over the three sets. e speech recogniser is based on the Resource Management recipe distributed with the Hidden Markov Model Toolkit (Young et al. ). It uses -dimensional feature vectors consisting of s and the log energy, and delta and delta-delta coe cients. It is a cross-word state-clustered triphone system with six components per output distri- . dB dB Clean . . Diagonal Joint . . Full Joint . . Table . Reference system word error rates (in ) for Joint uncertainty decoding with regression classes. Numbers are reproduced from Liao and Gales ( a), tables , , and . bution and a bigram grammar. e Gaussian mixture models were trained with iterative mixture splitting. . task e task (Hirsch and Pearce ) is a more extreme task than the Resource Management one, with a small vocabulary and lower signal-to-noise ratios. e clean speech was taken from the digits database of connected digits, and di erent real-world noises were added. e clean data consists of utterances by male and female speakers. di erent noise conditions are provided: four signal-to-noise ratios ( dB, dB, dB, and dB) with additive noise recorded at four places: suburban train, crowd of people (babble), car, and an exhibition hall. Each of the conditions has sentences for matched training and sentences for testing. e reference speech recogniser uses -dimensional feature vectors consisting of s and the unnormalised log energy, and delta and delta-delta coe cients. e acoustic models are whole word digit models, each with emitting states, three mix- tures per state and silence and inter-word pause models. . Reference systems is work nds a trade-o between the computational cost of noise-robustness tech- niques and their recognition accuracy. Table . contains word error rates on both the Resource Management task with a signal-to-noise ratio of dB and the task with a signal-to-noise ratio of dB. It is felt that these conditions provide an interesting balance between performance and di culty of the tasks. e numbers will illustrate why to compensate models for noise with ( ) Joint uncertainty decoding; ( ) full covariance Joint uncertainty decoding; and ( ) feature space transformations. In these noise conditions the clean models perform badly. Word error rates for the diagonal Joint transformation, estimated on stereo data, make the case for model com- pensation for noise with Joint uncertainty decoding. However, that the diagonal version does not perform as well as the full-covariance bias one indicates the importance of full covariances in noise. Table . on the facing page shows word error rates of single-pass retrained systems (see section . on page ). ese systems e ectively provide upper bounds on the per- . . - dB dB Diagonal covariance . . Semi-tied covariance . . Table . Matched system word error rates (in ). Numbers for diagonal systems are reproduced from Liao and Gales ( a), tables , and . Iteration Auxiliary function - . ½ - . - . Table . Values for the auxiliary function while training semi-tied transforma- tions on the Resource Management corpus. See gure . on the following page for the graphs. formance of di erent forms of model compensation. e matched diagonal covariance systems have the maximum obtainable performance with diagonal covariance matrices. e word error rates of the matched semi-tied covariance systems form the motivation for the subject of this work: nding optimal feature space transformations for noise ro- bustness. . Predictive semi-tied covariance matrices Section . on page has shown how to estimate semi-tied covariance matrices. e (r) procedure alternates between nding the new transformation A P and the new diagonal ˜ (m) covariance Σ diag . Updating both constitutes an iteration. e latter half of an iteration is one matrix chain multiplication, but the former half iteratively updates the rows of the transformation in an inner loop. e number of iterations for both the inner and the outer loop in these experiments is . Higher values did not improve the results noticeably. Figure . (a) on the next page depicts the value of the auxiliary function, which gives a lower bound of the log-likelihood, while estimating a -component predictive semi- tied covariance transformation. It is clear that the large leap forward is taken during (r) the rst update of A P , the rst half iteration. It is unclear from gure . (a), however, that the iterations a er that add anything. Figure . (b) therefore contains the graph from that point only, and shows the monotonic increase that expectation-maximisation guarantees. Table . has numbers on the great improvement in the rst half iteration, and the small improvement a er that. Testing the di erent transformations shows that the performance follows the log- e initial situation, with A P = I, (r) likelihood. Table . on page has the results. (r) ˜ (m) results in a somewhat strange Joint transform, with a full A J and a diagonal Σ diag , which is known to lead to abysmal performance (Liao and Gales ). Taking the rst leap in log-likelihood, by running only half an iteration to nd the transformation in . • ½ iteration −60 −65 Log-likelihood −70 −75 −80 • Initial 0 2 4 6 8 10 Outer training iteration (a) e auxiliary function over iterations – −58.3 −58.4 Log-likelihood −58.4 −58.4 • ½ iteration 0 2 4 6 8 10 Outer training iteration (b) e auxiliary function over iterations ½– Figure . Values for the auxiliary function while training semi-tied transforma- tions on the Resource Management corpus. See table . on the preceding page for the numbers. . . Word error rate ( ) Diagonal . Joint Full . Initial . Predictive semi-tied Half . Full . Table . Word error rates (in ) comparing Joint uncertainty decoding transfor- mations with predictive semi-tied covariance matrices on the Resource Manage- ment corpus at a signal-to-noise ratio of dB. Numbers in the top half are baseline numbers from table . on page . Signal-to-noise ratio dB dB dB dB Diagonal . . . . Joint Full . . . . Half . . . . Predictive semi-tied Full . . . . Table . Word error rates (in ) comparing Joint uncertainty decoding transfor- mations with predictive semi-tied covariance matrices on . Numbers in top half are reproduced from Liao and Gales ( a), table . ( . ) on page , also yields a large leap in recognition accuracy. Running the other ½ iterations, the transformation turning out as in ( . ) on page , gives only another . absolute reduction in word error rate. Perhaps surprisingly, estimating the predictive semi-tied covariance matrices with the full scheme yields a better performance than the full Joint transformation — the form that was approximated in the rst place. e paradox can be solved by the observation that the semi-tied transformation can optimise the feature space in which the calcu- lations take place, which Joint transforms cannot do. Indeed, the performance ceiling for the predictive semi-tied system was given in table . on page . With a di erence of only . absolute, the predictive version based on the Joint uncertainty decoding transform performs very well. Table . has word error rates on the corpus at varying signal-to-noise ratios. Comparing the higher signal-to-noise ratios with the lower, more extreme, ones, where the covariance bias should have a larger in uence, two observations can be made. First, both forms of semi-tied covariance matrices follow full Joint uncertainty decoding on the heels, whereas the gap between diagonal and full Joint becomes larger. Second, with a heavier covariance bias a transformation of the diagonalised covariance does become more desirable. But, again, predictive schemes are able to keep up with the performance of the original Joint transformation. . −60 −65 Log-likelihood −70 −75 −80 0 2 4 6 8 10 Training iteration (a) e auxiliary function over all iterations. −60.4 Log-likelihood −60.4 −60.4 −60.5 2 4 6 8 10 Training iteration (b) e auxiliary function a er iteration . Figure . Values of the auxiliary function while training covariance trans- formations on the Resource Management corpus. . . Word error rate ( ) Diagonal . Joint Full . . Predictive . Table . Word error rates (in ) comparing Joint uncertainty decoding trans- formations with predictive transforms on the Resource Management corpus at a signal-to-noise ratio of dB. Numbers in the top half are baseline numbers from table . on page . Signal-to-noise ratio dB dB dB dB Diagonal . . . . Joint Full . . . . . . . . Predictive . . . . Table . Word error rates (in ) comparing Joint uncertainty decoding transfor- mations with predictive linear transformations on . Numbers in top half are reproduced from Liao and Gales ( a), table . . Transformations without covariance bias e predictive semi-tied covariance schemes retain a covariance bias. Sections . . on page and . . on page have introduced approximations that do not require this, and are therefore faster (see section . on page ): predictive , and predictive . Predictive is basically the same as the half-iteration semi-tied covariance scheme, but without the bias. It is estimated in the same way. However, since the initial covariance is a worse approximation of the real covariance in noise, it takes more iter- ations to converge to the nal transform. For the Resource Management experiments, a er iterations performance does not increase. On the task, however, which has more extreme conditions, the experiments need iterations, unlike any other transformations. Figure . (a) on the preceding page shows the auxiliary func- tion while estimating a -component predictive covariance transform, and g- ure . (b) on the facing page zooms in on everything but the rst iteration. e results on the Resource Management task are in table . . Predictive , the more powerful method, performs better than predictive . e results on the task are in table . . Surprisingly enough, predictive consistently per- forms slightly better than predictive . is may be caused by the estimation of predictive , which uses expectation – maximisation, getting stuck in a local maximum. . Half semi-tied Full semi-tied Table . e total number of iterations (I J) performed to estimate the predictive transforms on the Resource Management and tasks. dB dB Clean . . Diagonal . . Joint Full . . Diagonal, classes . . . . . . Predictive Half semi-tied . . Full semi-tied . . Table . Word error rates (in ) for the salient data sets at di erent signal-to- noise ratios. Transformations use regression classes except as noted. Numbers in the top half are reproduced from Liao and Gales ( a), tables , , and . . Practical considerations Section . on page has discussed the computational cost of the four proposed predic- tive estimation schemes; this chapter has discussed the gains. e objective of this work is to nd a way to strike a balance between the computational cost and precision. Joint uncertainty decoding with a full covariance bias is felt to be too slow; Joint uncertainty decoding with a diagonal covariance bias is felt to lose too much accuracy. Even when the number of regression classes is increased, the performance of diagonal version tends to plateau. Table . compares the relevant systems on the salient data sets. Diagonal and full Joint with regression classes are the variants the predictive methods strike a balance between. In practice, diagonal Joint with regression classes is the competitor in terms of e ciency. On the task with the more extreme noise condition, a signal-to- noise ratio of dB, all linear transformation methods e ect a gain in accuracy compared to diagonal Joint. is con rms the need for feature space transformations in more adverse noise conditions. On the Resource Management corpus at a signal-to-noise ratio of dB only the predictive semi-tied covariance schemes can compete with the Joint uncertainty decod- ing transforms that use a covariance bias. e two schemes without covariance biases, predictive and predictive , cannot keep up with even diagonal Joint transformations. Predictive is unlikely to be used in practice: the means of the models needs to change anyway. While the adaptation scheme is at it, it might as well transform the variances as well and get the reduction in error rate. Also, as table . . . shows, the total number of iterations under lower signal-to-noise ratios is higher for predictive . e e ciency of adaptation with predictive , however, does not depend on the number of models at all. Since it is so e cient, it may be an alternative for diagonal Joint in embedded systems where computational speed is an important issue, or as an initialisation for adapting transforms. Chapter Conclusion is work has looked into predictive linear transformations for noise robustness. On- line speech recognisers, say systems embedded in cars, need to cope with changing noise conditions. Joint uncertainty decoding can estimate a noise transform from only a few frames of noise. However, adapting the models and decoding is slow. A linear adaptation method such as has little impact on decoding speed, but it needs noisy speech data to train it on, so that recognition can not start right away. is work, by estimating a Joint transform and using the statistics predicted by it to nd a linear transform, has combined the instant availability of the one and the high throughput of the other. Various predictive linear transformation schemes have been presented. e out- come is that the cost of decoding with Joint transforms can be e ectively combated in two ways. Predictive semi-tied covariance matrices make it possible to use diagonal covariance matrices rather than Joint uncertainty decoding’s full ones. Predictive completely removes the need to change the models. e presented schemes have been tested on arti cially corrupted data. Predictive semi-tied covariance matrices have yielded word error rates similar to the original full Joint transformation, sometimes even better. Predictive has shown word error rates around those of diagonal Joint transformations, but without the cost of adapting the models. In conclusion, Joint uncertainty decoding’s trademark covariance bias has proven to be indispensable for optimal performance; but with the right feature space transformation it does not have to be full. . Future work Estimating predictive transforms is essentially taking the long way around. e advan- tage is that it needs only a few frames of data to estimate the intermediate transform, which is found by Joint uncertainty decoding in this work. Arti cially corrupted data has various advantages as a testbed, not the least of which is that it is easy to control . . the circumstances. For testing predictive transforms this means that a transform of the same form can be directly estimated on the data, providing the maximum obtainable performance. If estimating transforms predictively from other transforms is the long way around, then estimating them directly on stereo data is the shortcut. e di erence between the two quanti es the di erence between the predictive transform and the ideal transform. By comparing the performance of single-pass retrained and predictively trained sys- tems, features space transformations from Joint uncertainty decoding estimations have appeared almost as good as their ideal counterparts under laboratory conditions. e next step is to apply the techniques pioneered in this work to real-world conditions, and Joint transformations estimated with, for example, Joint adaptive training (Liao and Gales b). is will be the starting point for further research. Bibliography Alex Acero ( ). Acoustical and Environmental Robustness in Automatic Speech Recog- nition. Ph.D. thesis, Carnegie Mellon University. Alex Acero, Li Deng, Trausti Kristjansson, and Jerry Zhang ( ). “HMM adapta- tion using vector Taylor series for noisy speech recognition.” In Proceedings of the International Conference on Spoken Language Processing. vol. , pp. – . Beverly Collins and Inger Mees ( ). e Phonetics of English and Dutch. Brill, Leiden. A. P. Dempster, N. M. Laird, and D. B. Rubin ( ). “Maximum Likelihood from In- complete Data via the Algorithm.” Journal of the Royal Statistical Society ( ), pp. – . V. V. Digalakis, D. Rtischev, and L. G. Neumeyer ( ). “Speaker Adaptation Using Constrained Estimation of Gaussian Mixtures.” Transactions on Speech and Au- dio Processing ( ), pp. – . M. J. F. Gales and P. C. Woodland ( ). “Mean and Variance Adaptation within the Framework.” Computer Speech and Language , pp. – . M. J. F. Gales and S. J. Young ( ). “Robust speech recognition in additive and convo- lutional noise using parallel model combination.” Computer Speech and Language , pp. – . M. J. F. Gales and S. J. Young ( ). “Robust continuous speech recognition using parallel model combination.” Transactions on Speech and Audio Processing ( ), pp. – . Mark J. F. Gales ( ). Model-Based Techniques for Noise Robust Speech Recognition. Ph.D. thesis, Cambridge University. Mark J. F. Gales ( a). “Maximum Likelihood Linear Transformations for -based Speech Recognition.” Computer Speech and Language ( ), pp. – . Mark J. F. Gales ( b). “Predictive Model-Based Compensation Schemes for Robust Speech Recognition.” Speech Communication ( - ), pp. – . Mark J. F. Gales ( ). “Semi-Tied Covariance Matrices for Hidden Markov Models.” Transactions on Speech and Audio Processing ( ), pp. – . Hynek Hermansky ( ). “Perceptual linear predictive ( ) analysis of speech.” e Journal of the Acoustical Society of America ( ), pp. – . u Hans-G¨ nter Hirsch and David Pearce ( ). “ e experimental framework for the performance evaluation of speech recognition systems under noise condi- tions.” In Proceedings of - . pp. – . J. Junqua and Y. Anglade ( ). “Acoustic and perceptual studies of Lombard speech: application to isolated-word automatic speech recognition.” In Proceedings of the In- ternational Conference on Acoustics, Speech, and Signal Processing. pp. – . Do Yeong Kim, Chong Kwan Un, and Nam Soo Kim ( ). “Speech recognition in noisy environments using rst-order vector Taylor series.” Speech Communication , pp. – . C. J. Leggetter and P. C. Woodland ( ). “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models.” Computer Speech and Language ( ), pp. – . Hank Liao and Mark J. F. Gales ( ). “Uncertainty Decoding for Noise Robust Auto- matic Speech Recognition.” Tech. Rep. / - / . , Cambridge Univer- sity Engineering Department. Hank Liao and Mark J. F. Gales ( ). “Uncertainty Decoding for Noise Robust Speech Recognition.” In Proceedings of Interspeech. Hank Liao and Mark J. F. Gales ( a). “Issues with Uncertainty Decoding for Noise Robust Speech Recognition.” Tech. Rep. / - / . , Cambridge Univer- sity Engineering Department. Hank Liao and Mark J. F. Gales ( b). “Joint Uncertainty Decoding for Robust Large Vocabulary Speech Recognition.” Tech. Rep. / - / . , Cambridge Uni- versity Engineering Department. Pedro J. Moreno ( ). Speech Recognition in Noisy Environments. Ph.D. thesis, Carnegie Mellon University. P. Price, W. M. Fisher, J. Bernstein, and D. S. Pallett ( ). “ e -word resource management database for continuous speech recognition.” In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. vol. , pp. – . A. J. Viterbi ( ). “Error bounds for convolutional codes and asymptotically optimum decoding algorithm.” Transactions on Information eory , pp. – . S. J. Young and P. C. Woodland ( ). “State clustering in -based continuous speech recognition.” Computer Speech and Language ( ), pp. – . Steve Young, Gunnar Evermann, Mark Gales, omas Hain, Dan Kershaw, Xun- ying (Andrew) Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, and Phil Woodland ( ). “ e book (for Version . ).” URL http://htk.eng.cam.ac.uk/docs/docs.shtml.
Pages to are hidden for
"Optimal Feature Spaces for Noise-Robust Speech Recognition "Please download to view full document