Docstoc

rosenberg

Document Sample
rosenberg Powered By Docstoc
					   Bioinformatic Voice Applications:
   Speaker Recognition and Verification
Andrew Rosenberg
Biometric Seminar Day
August 23, 2010




                                          1
Outline

• Biometrics and Voice


• What can the Voice tell us about a Speaker


• Representing Speech


• Modeling Speakers


  • Gaussian Mixture Model


  • Universal Background Model




                                               2
Biometrics and Voice

• Applications of Voice Biometrics


   • Speaker Verification
     Are you who you say you are?


   • Speaker Recognition
     Who are you?


   • Diagnoses of Medical Pathologies and other Speaker States
     The voice can tell us other things about a speaker




                                                                 3
Advantages of Voice Biometrics

• Minimally Intrusive


• Cheap Mechanisms to Collect Speech Data


• Established, low-risk, legal eavesdropping scenarios




                                                         4
Biometrics and Voice

• How does speech carry biometric information?


• How is speech produced?


   • Articulators


   • Vocal Tract


• First Language and Regional Influences


• Speech Pathologies


• Individual Differences

                                                 5
Production of Speech




                       6
Production of Speech



                                            Quic kTim e™ and a
                                         Cinepak decompres sor
                                    are needed to s ee this pic ture.




                               Its ten below outside




From the Queens University Speech Production and Perception Laboratory http://psyc.queensu.ca/~munhalik/index.html
                                                                                                                     7
 Production of Speech



                                             QuickTime™ and a
                                          Cinepak decompressor
                                     are needed to see this picture.




Why did Ken set the soggy net on top of his deck?




 From the Queens University Speech Production and Perception Laboratory http://psyc.queensu.ca/~munhalik/index.html
                                                                                                                      8
Influences of Native Tongue

• Negative Language Transfer


• When speaking in a non-native tongue, speakers will use some characteristics
  from their native tongue.


  • Very common in pronunciation
    /r/ vs. /l/ in Japanese and Chinese


  • Cognates and false-cognates
    “elektrisch” = electricity
    “embarasada” ≠ embarassed


  • Limited evidence of language transfer regarding grammar and word choice.


                                                                                 9
Assessment and Monitoring of medical problems

• How well is a patient coping with cancer treatment?
  Zellerman (2002)


• Is a patient clinically depressed?
  Alpert (2001) Moore (2003) Mundt (2007)


• Diagnosis of Schizophrenia through word choice
  Elvelag (2007 & 2009)


• Autism Spectrum Disorders demonstrated through lexical effects and “flat”
  prosody
  Rapin & Dunn (2003) Mesibov (1992) Le Normand (2008) Van Santen (2009




                                                                              10
Automatic Detection of Pathological Speech

• Apraxia
  Green (2004) Shriberg (2004)


• Spasmodic Dysphonia & Muscular Tension Dysphonia
  Schlotthauer (2006)


• Stuttering
  Howell (1997) Czyzewski (2003)


• Parkinson’s
  Little (2008) Hammen (1989)


• Dyslexia
  Schulte-Köme (1999)

                                                     11
Speaker Verification

• Are you who you say you are?


• Security Applications
   • Banking
   • Restricted Facility Entry
   • Forensics


• Compare stored speech against test speech


• Statistical modeling




                                              12
Text Dependent vs. Text Independent

• Text Dependent

  • Everyone says the same short phrase

• Text Independent

  • Speakers say whatever they want.

  • Typically no impact of the words that are said



• Text Dependent approaches have higher performance

• Text Independent approaches are more widely applicable




                                                           13
Speaker Verification Schematic Pipeline

                           Training
speech data
                         Speech         Statistical     speaker
                     Parameterization   Modeling         model
known speaker
   identity

                            Testing
                      Speech
speech data       Parameterization
                                           Score         Accept /
                                        Normalization     Reject
                     Statistical
claimed speaker       Models
     identity                                                 14
Representation of Speech

• Mel-Frequency Cepstral Coefficients



                                                               Cepstral
                 windowing       FFT       Filter Bank
                                                           Transform (DCT)




• Typically taken every 10ms


• Often 20 coefficients


• Also include ∆ and ∆∆ in the feature vector, for a vector of 60 elements

                                                                             15
16
Gaussian Model

• Gaussian Model or Normal Distribution

  • Common and Easy to Work With

  • Has 2 parameters: mean, variance (or standard deviation)




                                                               17
Gaussian Models in Higher Dimensions

• Normal Distributions in higher dimensions require slightly more complicated
  math, but operate identically


• Two parameters: A mean vector with d elements, a d-by-d covariance matrix.




                                                                                18
Training a Gaussian Model

• The Gaussian Model that best fits a set of data has the traditional mean and
  standard deviation values.


   • Can be proven with calculus, but we’re not going to today.




                                                                                 19
Gaussian Mixture Model

• But a lot of data is not actually normally distributed.


• A Mixture of Gaussian Models (GMM) allows us to add contributions from a
  number of Gaussians to best fit the data.




                                                                             20
Modeling with a Gaussian Mixture Model

• Fitting a GMM to data.
   • There isn’t a closed form to find the best parameterization of a GMM.
• Expectation-Maximization
   • Powerful iterative optimization approach.
   • Can be slow
   • Can fall into local optima
   • Algorithm:
      • Initialize
      • Assign points to mixtures
      • Estimate mixture parameters
      • Repeat until convergence

                                                                             21
Speaker Verification Schematic Pipeline

                           Training
speech data
                         Speech         Statistical     speaker
                     Parameterization   Modeling         model
known speaker
   identity

                            Testing
                      Speech
speech data       Parameterization
                                           Score         Accept /
                                        Normalization     Reject
                     Statistical
claimed speaker       Models
     identity                                                 22
Score normalization

• What does a score of .0005 mean?


• At what score should a system accept a users claim that they are who they
  say they are?


• We want to compare the likelihood that a speaker is who they say they are to
  the likelihood that they are another speaker.


• Universal Background Model




                                                                                 23
Speaker Verification with UBM score
normalization

• For each speaker we have a GMM
  representing their voice.


• Additionally, we have one UBM-GMM
  that represents “speech” generally.




                                        24
Speaker Verification Schematic Pipeline

                           Training
speech data
                         Speech         Statistical     speaker
                     Parameterization   Modeling         model
known speaker
   identity

                            Testing
                      Speech
speech data       Parameterization
                                           Score         Accept /
                                        Normalization     Reject
                     Statistical
claimed speaker       Models
     identity                                                 25
Speaker Recognition

• Given speech from an unknown speaker can you tell me who it is?


• Requires some known material from the person in question.


• Now no longer a binary (True vs. False) question. Now a 1-of-N problem.




                                                                            26
27
Speaker Recognition Overview

                           Training
speech data
                         Speech         Statistical     speaker
                     Parameterization   Modeling         model
known speaker
   identity

                            Testing
                      Speech
speech data       Parameterization
                                           Score          Speaker
                                        Normalization    Prediction
                     Statistical
claimed speaker       Models
     identity                                                  28
State-of-the-art Speaker Verification

• What we have works fine.


• There has been a significant improvement to the state-of-the-art.


• Rather than model a speaker directly
  ... model how the speaker differs from the average speaker (UBM).


• How can we do this?


• Move the UBM to best fit the new speaker.




                                                                      29
Maximum A Posteriori Adaptation

• Update the UBM model parameters to best fit the new speaker data.




                                                                      30
Maximum A Posteriori Adaptation

• Update the UBM model parameters to best fit the new speaker data.




                                                                      31
Maximum A Posteriori Adaptation

• Store the transformation (or
  new value) of each parameter.
                                                     UBM

• Construct a new feature            speech
  vector.                         representation
                                     (MFCC)          MAP
• Classifier using SVM (or
  another classifier)                 supervectors

• “Supervectors”
                                                   Classifier
  Feature vectors of model                          (SVM)
  parameters rather than
  speech features.


                                                                32
UBM-MAP Overview

                        Training
speech data
                       Speech                       training
                                      UBM-MAP
                   Parameterization               supervectors
known speaker
   identities
                                                      SVM
                         Testing                     Training



                    Speech
speech data     Parameterization
                                         SVM          Speaker
                                        Testing      Prediction

                   UBM-MAP

                                                            33
Limitations of Current Speaker Verification and
Adaptation

• Require Training material from the target.


• Can be slow to train.


• Best performance with Text-Dependent approaches




                                                    34
Summary of Voice Biometrics

• Speech carries speaker specific information
   •   Physiology
   •   Native Language Interference
   •   Personality
   •   Speaker State
   •   Idiosyncracies
• Speech is an attractive Biometric option.
   • Inexpensive Technology requirements
   • Minimally intrusuve
   • Low-risk surveillance

• GMM modeling is a powerful way to statistically model a speaker’s voice for
  recognition and verification.

   • >85-95% classification accuracy

                                                                                35
Questions?
Feel free to email: andrew@cs.qc.cuny.edu




                                            36

				
DOCUMENT INFO