phonemes by xiagong0815

VIEWS: 10 PAGES: 19

									            Phoneme Recognition using
                Temporal Patterns
                       Petr Schwarz, Pavel Matějka




            Brno University of Technology, Czech Republic
         OGI School of Science and Engineering at OHSU, USA

            E-mail: matejkap@feec.vutbr.cz, schwarzp@fit.vutbr.cz
September 23-24 2003           M4 meeting Delft                     1
                       Outline
•   The goal
•   Experimental setup and system
•   Baseline experiment with MFCC and MFCC multi-frame
•   Comparison of conventional MFCC and novel TempoRAl
    Patterns (TRAPs) features under well matched and
    mismatched conditions
•   Optimization of TRAPs for our task
•   New three-band TRAPs system
•   Implementation and distribution of the SW
•   Conclusions and future work…

September 23-24 2003   M4 meeting Delft              2
                         The goal
• For many applications, speech needs to be transcribed into
  discrete symbols.
• very reliable phoneme recognizer (not only) for meeting
  domain
• no language constraints
• suitable as a front end to LVCSR, for keyword spotting,
  speaker recognition, language recognition or recognition of
  out-of-vocabulary words
            Comparison of several techniques for automatic
            recognition of unconstrained context-
            independent phonemes
 September 23-24 2003       M4 meeting Delft                 3
                  Experimental setup
• Two databases TIMIT and NTIMIT
       - all SA records are removed
       - databases down-sampled to 8000 Hz
       - 412 speakers for training, 50 for CV, 168 for test
• The phoneme set contains 39 phonemes
       - very similar to CMU/MIT phoneme set
       - closures are merged with burst (bcl b  b)
• Experimental system is NN/HMM hybrid
       - phoneme insertion penalty tuned to the equal
        number of inserted and deleted phonemes

 September 23-24 2003     M4 meeting Delft                    4
               Experimental system
                         Posterior probablity
             MFCC, TRAPS      estimator           Viterbi decoder

                  Feature                                aa

                                                         ae


                extraction
                                                         z




                             Which classifier?


                                                              pau hh ae l ow pau




September 23-24 2003           M4 meeting Delft                                5
    Which classifier, GMM or NN?
• HMM-GMM and HMM-NN with one-state models
• MFCC + Δ + ΔΔ features
• Number of parameters is increased until the decrease
  in phoneme error rate (PER) is negligible (<0.5 %)

           System           PER [%]            Parameters
           GMM                 42.0             788736
           NN                  41.6              31200

     NN doesn’t degrade performance compared to GMM
+ 2 % absolute by merging
 September 23-24 2003       M4 meeting Delft                6
           Single frame and multi-frame input
                with MFCC – FeatureNet
• Subsequent frames are joined together
• Size of context is being increased to find minimal PER
• 300, 400 and 500 neurons in hidden layer tested -
  minimum change but the best is 400
           42


           41


           40                                                frames   PER [%]
 PER [%]




           39                                                1         41.6
                    PER = 37.5 %
           38                                                5         37.5
           37
                0       5            10                 15
                            frames

 September 23-24 2003                M4 meeting Delft                         7
                  TempoRAl Patterns
1. frequency-localized posterior probabilities of phonemes
   are estimated from temporal evolution of critical band
   energies within a single critical band
2. such estimates are used in another class-posterior
   estimator which estimates the overall phoneme probability
   from the probabilities in the individual critical bands.


                                            1. band classifier
                                            2. band classifier




                                            N. band classifier
 September 23-24 2003    M4 meeting Delft                        8
             TRAP system scheme

                       Norm

                                                a
                                                a
                                                e
                                                z


                       Norm



                                          pau hh ae l ow pau




September 23-24 2003   M4 meeting Delft                 9
                MFCC and TRAP on
               well-matched conditions
• Training and testing data are from the same database
• Similar performance of MFCC multi-frame and 1s long TRAPs
• Improvement can be obtained when length of TRAP is optimized



 PER [%]                      TIMIT            NTIMIT
 MFCC39                         41.6             55.6
 MFCC39 5frames                 37.5             49.0
 TRAP 1sec                      37.9             49.6
 September 23-24 2003    M4 meeting Delft                 10
    MFCC and TRAP on mismatched
             conditions
• Training and testing data are from different databases
• TRAP system yielded better results in both mismatched
  conditions
• It’s better to train the system on corrupted speech rather
   than on clean one

 PER [%]                  TIMIT/NTIMIT         NTIMIT/TIMIT
 MFCC39                           80.9               63.4
 MFCC39 5frames                   80.1               75.7
 TRAP 1sec                        75.0               56.6
  September 23-24 2003      M4 meeting Delft                   11
             Effect of length of TRAP
• The original TRAP length was kept 1 second long to be sure that
  it covers all information about phoneme in the critical band, but
  the length is not optimal
• 300 ms long context is the best for the TIMIT database

                          42

                          41

                          40
                PER [%]




                          39
                                         PER = 36.1 %
                          38

                          37

                          36

                          35
                               0   200   400         600    800   1000
                                               time [ms]
  September 23-24 2003                   M4 meeting Delft                12
          Effect of mean and variance
                  normalization
• Experiment was performed on original 1 second long TRAPs
• Significant degradation caused by both normalizations can be
  seen in well-matched conditions
• Mean normalization always helps in mismatched condition,
   the benefit of variance normalization is less clear

Normalization /                                TIMIT/ NTIMIT/
                        TIMIT      NTIMIT
PER [%]                                        NTIMIT TIMIT
None                     37.9          49.6     75.0     56.6
Mean                     40.5          51.8     73.5     54.7
Mean & variance          42.6          53.2     74.8     54.1
 September 23-24 2003       M4 meeting Delft                    13
   TRAP with more than one critical
               band
• Three neighboring temporal vectors were merged
  together and sent to one classifier


                                           Posterior probabilities
                                             of phonemes for
                                            each triple of bands


                                    system       PER [%]
                                    TRAPS          36.1
                                    3 band TRAPS   33.7
September 23-24 2003   M4 meeting Delft                       14
  Implementation and distribution of
         the SW: phnrec

 • Early experiments performed with a set of scripts
   interconnecting execs: trapper, QuickNet, HTK,… – still
   used for the training.
 • Phoneme recognition – in phnrec containing:
      – feature extraction (MFCC (compat HTK), FeatureNet, TRAPS) –
        from files or microphone
      – posterior-probability estimator (NN –compatible with QuickNet
        nets)
      – Viterbi decoder – can work also on-line with fixed delay.
 • Very good as black-box for people what want to consider
   speech-to-phoneme transcription as front-end
September 23-24 2003         M4 meeting Delft                      15
                       phnrec (2)
• Source codes for Linux and EXE for Windows available
  for free for research.
• Available with nets trained on US-English (TIMIT) and
  Czech (SpeechDat-E).
• More languages to come (also some Language ID
  experiments running in Brno)
• Works on-line


http://www.fit.vutbr.cz/speech/sw/phnrec.html


September 23-24 2003     M4 meeting Delft                 16
                       Conclusion
• TRAP based phoneme recognizer was built, comparison
  to MFCC.
• Properties of TRAPs were studied and TRAPs were
  optimized for phoneme recognition
• New multi-band TRAPs approach was tested and its
  benefit is proved
• The recognizer was successfully evaluated in language
  identification task
• An easy-to-use software was written and is available for
  research community.

September 23-24 2003     M4 meeting Delft               17
                       But …
• Adaptation to meeting data necessary (TIMIT clean
  training not good at all), updating the distribution on
  www.
• Tests on ICSI, IDIAP and Brno data (which phonemes
  going to work the best for us CzEnglish ?)
• Applications – LID already tested, kwd spotting and
  LVCSR (some papers at Eurospeech making use of
  phoneme strings).
• Phoneme lattices
• Real-time issues (1 band version running ok on
  reasonable machine, 3 band not) – NN weights pruning?


September 23-24 2003   M4 meeting Delft                 18
                       THE END
• A demo during the break.
• Please download phnrec, test it and
  comment !!!
• Questions ?




September 23-24 2003    M4 meeting Delft   19

								
To top