Learning Center
Plans & pricing Sign in
Sign Out

Evaluation of state of the art speaker adaptation algorithms


Evaluation of state of the art speaker adaptation algorithms

More Info
									Evaluation of state of the art speaker adaptation algorithms
O.Noah D Mashao Speech Technology Research Group Department of Electrical Engineering University of Cape Town, Rondebosch, 7701 South Africa

Abstract-This work-in-progress paper aims to evaluate the performance of three speaker adaptation algorithms for speech recognition systems. The main reason for this is that while SI systems perform well, SD systems have desirable speaker-specific features that make it perform better than SI systems. It is therefore desirable to have an SD system for a specific speaker. While its is difficult to train a large vocabulary SD system, speaker adaptation offers quicker methods of generating SD systems. Speaker adaptation requires little data to generate an SD system. The speaker adaptation algorithms to be evaluated are MAP, MLLR and DLLR. MAP and MLLR are used in many systems, while DLLR is new. Speaker adaptation, as mentioned in [4], offers commercial and time benefits. This avoids the requirement for large data when training an SD system in order to suit a new environment (a new speaker, or new surroundings). The main aim of this evaluation is to compare the time it takes the systems to adjust to the training data. 1. INTRODUCTION This paper describes the work done on the evaluation of three state of the art speaker adaptation algorithms. Speech recognition systems can be grouped as speaker dependent (SD) systems and speaker independent (SI) systems. An SD system is a system that models speakerspecific data and thus makes it suitable for use with a specific person. An SI system is a system that models various different speakers and this models variations. This system is suitable for use with many different people. For the SI systems, a database, like Switchboard, consisting of many different peoples’ utterances is used for training. This means that the system is trained with many different speakers. Having different people means there will be variations in speech [5]. Variability can cause difficulties for the recognition process. This system can yield good results when judging from the 7.2% word error rate (WER) obtained by [9] in a test done after using mixture incrementing as explained in [1]. Further improvements in [6],[7] that rid of variability can be used. For an SD system, a training database that consists of just one speaker is used for training. Less variability is found in this system hence results of about 1.8 % WER as found in [4].

In order to train large vocabulary SI or SD systems, a large amount of data is needed such that the large vocabulary model is trained. Problems arise from availability of time [8], availability of people for SI system, and insufficiency of data. Such problems affect the choice of system to be used. More importantly, it can be seen that SD systems perform better than SI systems given a speaker [4]. So to get better performance, an SD system can be built in place of an SI system, depending on circumstance. In light of data insufficiency, this is very time consuming and can be very expensive if the training is to be done on a communication medium like a telephone line [2]. A point to note is that if speech recognition is to be done where a speaker is not known in advance an SI system may be suitable. However, this may not be ideal since an SI system does not model a particular speaker’s acoustic models as accurately [4] as an SD system would due to variability of speakers found in an SI system as mentioned above. Speaker adaptation overcomes these difficulties. It does so by requiring relatively little data when compared with training data, in order to generate an improved speech recognition system. This paper is divided into 4 main sections. The first section introduces speaker adaptation. Section 2 then briefly describes the three algorithms that are under evaluation. Section 3 explains the experimental set-up, which is used. Section 4 presents future work to be done. 2. ADAPTATION ALGORITHMS In this evaluation we explore three speaker adaptation algorithms. Speaker adaptation is used for adapting HMM parameters of a different system to parameters of the current environment. In our case, we adapt the parameters of an SI system to parameters of an SD system. The adaptation process works on a basic assumption that the main thing that differentiates speakers are the means [5] of the HMM gaussians, hence much research works on adapting the means. We look at Maximum a-posteriori (MAP), Maximum Likelihood Linear Regression (MLLR) then Discounted Likelihood Linear Regression (DLLR). MAP adaptation, which is also known as a type of Bayesian adaptation is used for adapting SI HMM parameters to suit a newly enrolled speaker. This method relies on the fact that there is prior information of the HMM model parameters before adaptation happens. This uses parameters associated

with the maximum a posteriori probability as the Bayes estimate [5]. It can happen that prior information is not available as [5] has shown. When there is little data, the adaptation will not change the means by much. And since the MAP algorithm updates every single mean of the system, the system needs large amounts of data in order for it to work well [1]. MLLR is a transformation based adaptation algorithm [5] whereby transformations are generated depending on the amount of information available from the enrolling speaker. The algorithm checks to see if there is enough information available. If the information is sufficient, it generates enough transformations to transform means. However, if the information is not sufficient, a global transformation is applied [1] to all components with the hope of improving the system. As in most cases, adaptation data isn’t enough and this algorithm manages to take advantage of that. This algorithm also manages to take advantage of spectral differences between speakers [5]. Should the data be severely limited then the system over-trains and gives poor transformations. Finally there is DLLR. DLLR is for cases where there is insufficient data for MLLR adaptation [3]. Since MLLR has a problem with over-training when there is insufficient data DLLR solves the problem by introducing discounting as a means of terminating the algorithm before over-training occurs [3]. To solve this, the idea of discounted likelihood is introduced to the MLLR transformation equation as described by [3]. The main benefit of DLLR is that unlike MLLR, which assigns zero mass to unseen data, DLLR assigns non-zero weight to unseen data, and this is partly why this algorithm avoids over-training. There are two flavours of DLLR algorithms where one is moment interpolation and sample count of data is the other. 3. EXPERIMENTAL SETUP Adaptation experiments are to be performed within the HTK framework, and the TIMIT database is used for training. Testing uses a variety of custom databases. In the test set-up it is important to have a trained system, which will be used for tests under un-adapted scenarios and then the test system will be adapted and then tested with unseen data. This evaluates the true potential of adaptation as well as how valuable it is in modern speech recognition. Results that have been obtained are shown below. Table 1: Un-adapted system SI system Acc 95.8


Acc 97.5

The aim is to see how the system performs on unseen data after adaptation. In the above tables we see the difference in performance between an un-adapted and adapted systems at word level. Both of the systems are tested with unseen data. 4. FUTURE WORK Future work will look at improving speaker adaptation. The smoothing of parameter distribution prior to adaptation will be examined. Speaker normalization techniques will also be used. Improving of robustness over various communication channels will be looked at. 5. REFERENCES [1] Woodland P.C , Young S.J, (1993) The HTK Tied-State Continuous Speech Recogniser, Proc. Eurospeech ’93, Berlin [2]Power K.J Speech Technology for Telecommunications [3] Gunawadarna. A, Byrne.W, Discounted Likelihood Linear Regression for rapid speaker adaptation. 22 Aug 2000 [4] Leggetter C.J, Woodland P.C, Maximum Likelihood Linear Regression for speaker adaptation of continuous density hidden Markov models, 01 Feb 1995, Computer Speech and Language [5] Niewoudt. C, Botha .E.C, Cross-language use of acoustic information for automatic speech recognition, 6th March 2001, Speech Communication [6] Westphal. M, The use of Cepstral means in conversational speech recognition. [7] Woodland P.C, Pye D, Experiments in Speaker Normalization and adaptation for large vocabulary speech recognition [8] Weber, K. Ikbal, S. Bengio, S. Bourlard, H. (2003) Robust speech recognition and feature extraction using HMM2, Computer Speech and Language [9] Mahlanyane, N.S , Mashao, D.J, Using a low-bit speech enhancement Adaptive Filter to improve Speech Recognition performance. November 2003, Proc PRASA 2003


Table 2: Adapted SI system

To top