Powerpoint template for scientific posters Swarthmore College - PowerPoint

Document Sample
Powerpoint template for scientific posters Swarthmore College - PowerPoint Powered By Docstoc
					    The Robustness of MFCCs in Phoneme-Based Speaker Recognition using TIMIT
                                                                  Rio Akasaka ’09, Youngmoo Kim, Ph.D*
                                                           Department of Linguistics/Engineering, Swarthmore College *Drexel University


Introduction                                                     Results                                                                                                                                 Conclusions
Mel- Frequency Cepstral Coefficients (MFCCs)                                                                                        Vowels only, cont.                                                   While optimal performance in speaker recognition is
are quantitative representations of speech and are                The following nomenclature is adopted in this poster:                If individual phoneme files are used for training and             expected with a larger training set, the availability of
commonly used to label sound files. They are derived by           F: Full (complete) speech file                                       testing, the results are impressive.                              testing material did not seem to affect performance if
obtaining the Fourier transform of a signal and                   H: Speech file segmented at middle                                Train: 5V, Test: 5V                                                  at least three files are used and if the number
mapping the result on the mel-scale, which is an                  V: File consisting of vowel phones only                           Evaluated: 570 Correct: 533, Percentage: 0.935088                    of training files is equal to or greater than the
auditory perception-based scale of pitch differences.                                                                               Train: 3V, Test: 3V                                                  number of testing files. Though this might be
With these unique labels on speech files, the similarity                                                                            Evaluated: 342 Correct: 278, Percentage: 0.812865                    expected to extend to the length of the wav files, it was
                                                                 Control
between two files can be determined by the Kullback–                                                                                                                                                     not necessarily the case because using half a file to test
                                                                 Train: 5F , Test: 5F (not including SA)
Leibler (KL) distance, which is based on probability             Evaluated: 570 Correct: 493, Percentage: 0.864912
                                                                                                                                                                                                         consistently demonstrated poor results.
                                                                                                                                    Further examination
distributions, and, given a training set upon which to
                                                                                                                                    In order to extract more information about the role that
base one’s decisions, the corresponding speaker can be                                                                                                                                                   Most importantly, testing and training with vowel
                                                                 Difference in number                                               individual phones play in speaker recognition, the same
identified.                                                                                                                                                                                              phones only provided impressive recognition rates at
                                                                    Reducing both the number of training and testing                algorithm was applied to test recognition based on
                                                                                                                                    individual phonemes that are extracted from each                     approximately 93%, meriting further study.
                                                                    files to be consistent results in an optimal (~84.2%)
The goal of this research is to test the robustness of                                                                              speaker. The training set consists of files containing
                                                                    success rate, but only up to 3 files.
MFCCs in speaker detection by varying the testing and                                                                               only file segments for a particular phoneme, which are               With regards to individual phone contributions to
                                                                 Train: 3F , Test: 5F
training parameters with the following methods:                                                                                     then later tested individually.                                      recognition, it was found that a single phoneme does
                                                                 Evaluated: 570 Correct: 440, Percentage: 0.771930
                                                                                                                                                                                                         not predict a speaker more effectively when using the
                                                                 Train: 5F , Test: 3F
1) using segments of a whole speech file                         Evaluated: 342 Correct: 292, Percentage: 0.853801
                                                                                                                                                                                                         same phoneme to train, as compared to any other
2) varying the number of speech files used, and                                                                                                                                                          phoneme.
                                                                 Train: 3F , Test: 3F
3) splicing together the vowel phones of a speech                Evaluated: 342 Correct: 290, Percentage: 0.847953
   segment                                                                                                                                                                                               However, two phonemes consistently outperform the
                                                                                                                                                                                                         others in predicting 1a speaker: 'ae' and 'ay'. Of the five
                                                                                                                                                                                                         trials, 'ae' was ranked most highly recognized 3 times,
                                                                                                                                                                                                         'ay' was highest twice, and both were among the top
                                                                                                                                                                                                         two in four of the trials. More tests are being done to
                                                                                                                                                                                                         obtain a statistically significant conclusion.

The TIMIT Corpus
The TIMIT corpus was created as a joint effort
between Texas Instruments (TI) and MIT and consists
of time-aligned orthographic, phonetic and word
transcriptions for each of the 6300 16-bit 16kHz speech                                                                                                                                                  Literature cited
files. 630 speakers from the 8 major dialects of                                                                                    Figure 2. Speaker prediction based on individual phonemes. The       Cole, Ronald A., et al.. 1996. The Contribution of Consonants
American English each read from 10 ‘phonetically rich’                                                                              results show that while speaker recognition based on individual          Versus Vowels to Word Recognition in Fluent Speech
texts, among which 2 are common across all speakers.                                                                                phoneme is considerably low (μ=3.60%, σ=2.34), the diagonal          Van Heerden, C.J, E. Bernard. 2008. Speaker-specific variability
                                                                     Figure 1. Speaker recognition based on 144 vowel-based files
                                                                                                                                    does show slightly higher recognition rates, as would be expected.
                                                                                                                                                                                                             of Phoneme Durations.
PHONETIC                                                                                                                                                                                                 Fattah, Mohamed, Ren Fuji, Shingo Kuroiwa. 2006. Phoneme
10160 10733 y                                                                                                                                                                                                Based Speaker Modeling to Improve Speaker Identification
                                                                  Figure 1. The predicted speaker ID plotted against the actual
10733 11880 axr                                                   speaker, for 144 full speech files.

WORD
10160 11880 your                                                 Difference in size
                                                                    Performance is considerably better with full speech
ORTHOGRAPHIC                                                        files during testing, regardless of which half of the
0 57140 She had                                                     file we use.                                                                                                                         Acknowledgments
your dark suit                                                   Train: 5F, Test: 5H                                                                                                                     Grateful acknowledgement is made to Youngmoo Kim
in greasy wash                                                   Evaluated: 570 Correct: 327, Percentage: 0.573684                                                                                       for providing insight and direction throughout my
water all year.                                                  Train: 5H, Test: 5F                                                                                                                     research and to Jiahong Yuan for encouraging my
                                                                 Evaluated: 570 Correct: 442, Percentage: 0.775439                                                                                       pursuit of corpus phonetics.
                                                                 Train: 5H, Test: 3F
                                                                 Evaluated: 342 Correct: 277, Percentage: 0.809942
In order to investigate the distribution of the phonemes
in TIMIT, the plot shown above was generated. The
                                                                 Vowels only
average sample length is
                                                                    Individual phonemes are exceedingly difficult to use                                                                                  For further information
  1.676                                                           when predicting speaker based on entire wav files.              Figure 3. To test the possibility that one speaker is consistently
                                                                                                                                    retrieved as the ideal candidate for a particular phoneme, the
  592                                                          Train: 5V, Test: 3F                                                above plot was generated to plot the predicted speaker vs the         Please contact rakasak1@swarthmore.edu.
                                                                 Evaluated: 342 Correct: 242, Percentage: 0.707602                  actual speaker based on speaker ID. Speaker 183 is selected most      Further details about the methodology may be read
The individual texts may be phonetically rich, but taken         Train: 5F, Test: 5V                                                often in the above scenario.                                          online at wiki.rioleo.org
as a whole the distribution of the phonemes is                   Evaluated: 570 Correct: 53, Percentage: 0.092982
unbalanced.