The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT
Rio Akasaka ’09, Youngmoo Kim, Ph.D*
Department of Linguistics/Engineering, Swarthmore College *Drexel University
Introduction Results Conclusions
Mel- Frequency Cepstral Coefficients (MFCCs) Vowels only, cont. While optimal performance in speaker recognition is
are quantitative representations of speech and are The following nomenclature is adopted in this poster: If individual phoneme files are used for training and expected with a larger training set, the availability of
commonly used to label sound files. They are derived by F: Full (complete) speech file testing, the results are impressive. testing material did not seem to affect performance if
obtaining the Fourier transform of a signal and H: Speech file segmented at middle Train: 5V, Test: 5V at least three files are used and if the number
mapping the result on the mel-scale, which is an V: File consisting of vowel phones only Evaluated: 570 Correct: 533, Percentage: 0.935088 of training files is equal to or greater than the
auditory perception-based scale of pitch differences. Train: 3V, Test: 3V number of testing files. Though this might be
With these unique labels on speech files, the similarity Evaluated: 342 Correct: 278, Percentage: 0.812865 expected to extend to the length of the wav files, it was
between two files can be determined by the Kullback– not necessarily the case because using half a file to test
Train: 5F , Test: 5F (not including SA)
Leibler (KL) distance, which is based on probability Evaluated: 570 Correct: 493, Percentage: 0.864912
consistently demonstrated poor results.
distributions, and, given a training set upon which to
In order to extract more information about the role that
base one’s decisions, the corresponding speaker can be Most importantly, testing and training with vowel
Difference in number individual phones play in speaker recognition, the same
identified. phones only provided impressive recognition rates at
Reducing both the number of training and testing algorithm was applied to test recognition based on
individual phonemes that are extracted from each approximately 93%, meriting further study.
files to be consistent results in an optimal (~84.2%)
The goal of this research is to test the robustness of speaker. The training set consists of files containing
success rate, but only up to 3 files.
MFCCs in speaker detection by varying the testing and only file segments for a particular phoneme, which are With regards to individual phone contributions to
Train: 3F , Test: 5F
training parameters with the following methods: then later tested individually. recognition, it was found that a single phoneme does
Evaluated: 570 Correct: 440, Percentage: 0.771930
not predict a speaker more effectively when using the
Train: 5F , Test: 3F
1) using segments of a whole speech file Evaluated: 342 Correct: 292, Percentage: 0.853801
same phoneme to train, as compared to any other
2) varying the number of speech files used, and phoneme.
Train: 3F , Test: 3F
3) splicing together the vowel phones of a speech Evaluated: 342 Correct: 290, Percentage: 0.847953
segment However, two phonemes consistently outperform the
others in predicting 1a speaker: 'ae' and 'ay'. Of the five
trials, 'ae' was ranked most highly recognized 3 times,
'ay' was highest twice, and both were among the top
two in four of the trials. More tests are being done to
obtain a statistically significant conclusion.
The TIMIT Corpus
The TIMIT corpus was created as a joint effort
between Texas Instruments (TI) and MIT and consists
of time-aligned orthographic, phonetic and word
transcriptions for each of the 6300 16-bit 16kHz speech Literature cited
files. 630 speakers from the 8 major dialects of Figure 2. Speaker prediction based on individual phonemes. The Cole, Ronald A., et al.. 1996. The Contribution of Consonants
American English each read from 10 ‘phonetically rich’ results show that while speaker recognition based on individual Versus Vowels to Word Recognition in Fluent Speech
texts, among which 2 are common across all speakers. phoneme is considerably low (μ=3.60%, σ=2.34), the diagonal Van Heerden, C.J, E. Bernard. 2008. Speaker-specific variability
Figure 1. Speaker recognition based on 144 vowel-based files
does show slightly higher recognition rates, as would be expected.
of Phoneme Durations.
PHONETIC Fattah, Mohamed, Ren Fuji, Shingo Kuroiwa. 2006. Phoneme
10160 10733 y Based Speaker Modeling to Improve Speaker Identification
Figure 1. The predicted speaker ID plotted against the actual
10733 11880 axr speaker, for 144 full speech files.
10160 11880 your Difference in size
Performance is considerably better with full speech
ORTHOGRAPHIC files during testing, regardless of which half of the
0 57140 She had file we use. Acknowledgments
your dark suit Train: 5F, Test: 5H Grateful acknowledgement is made to Youngmoo Kim
in greasy wash Evaluated: 570 Correct: 327, Percentage: 0.573684 for providing insight and direction throughout my
water all year. Train: 5H, Test: 5F research and to Jiahong Yuan for encouraging my
Evaluated: 570 Correct: 442, Percentage: 0.775439 pursuit of corpus phonetics.
Train: 5H, Test: 3F
Evaluated: 342 Correct: 277, Percentage: 0.809942
In order to investigate the distribution of the phonemes
in TIMIT, the plot shown above was generated. The
average sample length is
Individual phonemes are exceedingly difficult to use For further information
1.676 when predicting speaker based on entire wav files. Figure 3. To test the possibility that one speaker is consistently
retrieved as the ideal candidate for a particular phoneme, the
592 Train: 5V, Test: 3F above plot was generated to plot the predicted speaker vs the Please contact email@example.com.
Evaluated: 342 Correct: 242, Percentage: 0.707602 actual speaker based on speaker ID. Speaker 183 is selected most Further details about the methodology may be read
The individual texts may be phonetically rich, but taken Train: 5F, Test: 5V often in the above scenario. online at wiki.rioleo.org
as a whole the distribution of the phonemes is Evaluated: 570 Correct: 53, Percentage: 0.092982