Docstoc

pitch_prediction

Document Sample
pitch_prediction Powered By Docstoc
					Pitch Prediction for Glottal Spectrum Estimation
    with Applications in Speaker Recognition



                  Nengheng Zheng
           Supervised under Professor P.C. Ching




                       Nov. 26 , 2004
                           Outline

• Speech production and glottal pulse excitation in detail

• Linear prediction: short-term and Long-term

• Glottal spectrum estimated with long-term prediction and
  acoustic features

• For speaker recognition implementation
                                           Speech Production

Discrete time model for speech production

                                    AV
   Impulse
               Glottal pulse
     train                     X
               model G(z)
   generator




                                   u(n) Vocal tract   Radiation   s(n)
                                                       model
                                        model V(z)
                                                        R(z)



                Random
                  noise        X
                generator
                                    AN


                                                                         Glottal   Vocal tract   Speech
                                                                         pulses                  signal
  A combined transfer function

    H ( z )  G( z )V ( z ) R( z )
          Acoustic Features of Glottal Pulse

• Time domain
   –   pitch period
   –   pitch period perturbation (jitter)
   –   pulse amplitude perturbation (shimmer)
   –   glottal pulse width
   –   abruptness of closure of the glottal flow
   – aspiration noise
• Frequency domain
   – fundamental frequency (F0)
   – spectral tilt (slope)
   – harmonic richness
            Glottal Pulse and Voice Quality
• Glottal pulse shape plays an important role on the quality of
  Natural or synthesized vowels [Rosenberg 1971]
    – The shape and periodicity of vocal cord excitation are subject to
      large variation
    – Such variations are significant for preserving the speech
      naturalness
    – A typical glottal pulse: asymmetric with shorter falling phase;
      spectrum with -12dB/octave decay



• More variation among different speakers than among different
  utterance of the same speaker [Mathews 1963]

• Such variations have little significance for speech
  intelligibility but affect the perceived vocal quality [Childers
  1991]
                   Various Glottal Pulses
• Some other vocal types
         breathy           falsetto       vocal fry




• Temporal and spectral characteristics
                     Some Comments

• Generally, to study the glottal pulse characteristics, it is
  necessary to rebuilding the glottal pulse waveform by inverse
  filtering technique

• Automatically and exactly rebuilding the glottal waveform
  from real speech is almost impossible, especially, at the
  transient phase of articulation, or, for high pitched speakers

• Fortunately, it is possible to estimate the glottal spectrum
  from residual signal with pitch prediction
                        Linear Prediction

• Speech waveform: correlation between current and past
  samples and thus predictable
                                               p

• Short-term correlation:          s ( n )    ak s ( n  k )
                                              k 1


       • Occurs within one pitch period
       • Formant modulation
       • Classical linear prediction analysis (short-term prediction)


• Long-term correlation           u (n)  bu(n  p)

       • occurs across consecutive pitch periods
       • Vocal cords vibration
       • Long-term/pitch prediction
                           Linear Prediction
• Short-term predictor <classical linear prediction>
                                                P
                            A( z )  1   ak z  k
                                             k 1

   – Remove the short-term correlation and result in a glottal excitation signal
                                                         P
                             u ( n )  s ( n )   ak s ( n  k )
                                                        k 1

• Long-term predictor <pitch prediction>
                     P ( z )  1  b1 z  ( p 1)  b0 z  p  b1 z  ( p 1)
   – Remove the correlation across consecutive periods
                                            1
                    v(n)  u (n)   bk u (n  p  k )
                                          k  1


                  s(n)      M                       _          u(n)                           _       v(n)
                           ai z
                                                                         1
                                     i
                                                          +             bk z   ( p  k )
                                                                                                  +
                            i 1                                       k  1



                    Short-term predictor                         Long-term predictor
                                                    Linear Prediction: A example



                                                                         1
                                                                                                                                                                s(n)

                        2
                                                                         0
                   10
  intensity (dB)




                     2
                   10
intensity (dB)




                        0                                               -1
                   100                                                                     0                                             100         200           300     400             500   600   700   800
                   10
                                                                       0.5
                   10-2
                        -2                                                                                                                                      u(n)
                                                                                                Power Spectrum Magnitude (dB)

                   10 0          1000        2000        3000   4000
                                                                         Power Spectrum Magnitude (dB)SpectrumMagnitude (dB)

                      0          1000        2000        3000   4000
                                        Frequency (Hz)                   0                                                      -20
                                        Frequency (Hz)                                                                          -20

                        1
                   101                                                 -0.5                                                     -40
                                                                                                                                -40
                   10
 Intensity (dB)
 Intensity (dB)




                                                                                           0                                             100         200           300     400             500   600   700   800
                                                                       0.5
                                                                                                                                 -60
                                                                          Power Spectrum Magnitude (dB)




                                                                                                                                -60
                                                                                                                                                               v(n)
                                                                                                Power




                        0
                        0                                                                                                            0         0.2    0.4         0.6     0.8          1
                   10
                   10                                                                                                               0          0.2    0.4        0.6     0.8          1
                                                                                                                                                        Frequency
                                                                                                                                                        Frequency
                                                                                                                                 -20
                                                                                                                                -20
                                                                         0
                             0   1000        2000
                                             2000        3000
                                                         3000   4000
                                                                4000                                                            -40
                                                                                                                                 -40
                                        Frequency (Hz)
                                                  (Hz)
                                                                       -0.5                                                     -60
                                                                                                                                 -60
                                                                                           0                                             100         200           300     400             500   600   700   800
                                                                                                                                -80
                                                                                                                                 -800
                                                                                                                                    0          0.2
                                                                                                                                               0.2    0.4
                                                                                                                                                      0.4       0.6
                                                                                                                                                                  0.6    0.8
                                                                                                                                                                          0.8         11
                                                                                                                                                        Frequency                80
                                                                        40                                                                                 Frequency
                                                                                                                                                                                 60
                                                                                                                                                                                 40
                                                                        20
                                                                                                                                                                                 20
Examples of pitch prediction estimated
          glottal spectrum

   40



   20



    0
         0   50   100   150   200   250   300   350   400   450   500
    6

    4

    2

    0
         0   50   100   150   200   250   300   350   400   450   500
    2

   1.5

    1

   0.5
         0   50   100   150   200   250   300   350   400   450   500
    8

    6

    4

    2

    0
         0   50   100   150   200   250   300   350   400   450   500
                   Harmonic Structure of Glottal Spectrum

         • Two parameters describing the harmonic structure
            – Harmonic richness factor and Noise-to-harmonic ratio


         • Harmonic richness factor (HRF)
                                                                     H
                                                                    H i Bn
                                                                              i

                                                    HRFn  10 log
                                                                       H1


         • Noise-to-harmonic ratio (NHR)                              N
                                                                    N i Bn
                                                                              i

                                                    NHRn  10 log
                                                                     H
                                                                    H i Bn
                                                                              i



10
                            H
                            i
5



0
     0       200      400       600   800    1000      1200    1400               1600   1800   2000
                      o
                      Ni
                                 Feature Generation

• Acoustic features including the following:
   – Fundamental frequency F0
   – Pitch prediction gain g
                                               g  10 log
                                                           u ( n)
                                                                 2


                                                           v ( n)
                                                                 2


   – Pitch prediction coefficients b-1, b0, b1
   – HRFn and NHRn <n=1:10>
        • 10 Mel scale frequency bank


• Feature generation process

                                     L-T
     s(n)                                                              Mel-scale
               S-T       u(n)    prediction     p, g, bi                            HRFn, NHRn,
                                                           G(z)G(f)   Bank pass
            prediction            on every                                           n=1,2,…,
                                                                        filtering
                                pitch period
                Experiments Conditions

• Speech quality: telephone speech

• Subject: 49 male speakers

• Training condition:
   – 3 training session, about 90s speech totally, over 3~6 weeks
   – 128 GMM

• Testing condition:
   – 12 testing sessions. Over 4~6 months.
           Speaker recognition experiments

• Identification results with long-term prediction related features

        Feature        F0           g    [b-1 b0 b1]      HRF   NHR

       Iden. Rate      18%         11%     14%            32%   17%

• Comparison of glottal source feature with classical features
                                         Identification
                      Features
                                         error rate (%)
               Fgs: F0_g_HRF_NHR25            52%
                     LPCC_D_A36               2.84
                    LPCC_D_A+Fgs              2.26
                     MFCC_D_A                  2.1
                    MFCC_D_A+Fgs               1.9
                           Summary

• Glottal source excitation is important for perceptional
  naturalness of voice quality and is helpful for distinguishing a
  speaker from the others.

• Linear prediction is a powerful tool for speech analysis. The
  spectral property of the supraglottal vocal tract system can be
  estimated by short-term prediction; While the long-term
  prediction estimates the spectrum of the glottal excitation
  system

• Recognition results show that the glottal source related acoustic
  features (F0, prediction gain, HRF, NHR, etc.) provide a certain
  degree of speaker discriminative power.
                  Other Applications


• Speech coding

• Speech recognition ?

• Speaking emotion recognition !
Thank You!

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:3
posted:1/25/2011
language:English
pages:18