Docstoc

mueller

Document Sample
mueller Powered By Docstoc
					Recommendations Based on
     Speech Classification
       (and examples of what recommender
  systems can learn from signal processing)

                                       Christian Müller
                 German Research Center for Artificial Intelligence
            International Computer Science Institute, Berkeley, CA
                            Overview
 Speech as a source of information for non-intrusive
  user modeling

 Speech/signal processing            Take-away messages

 Vocal aging -> features      Knowledge-driven
           Recommendations Based on
  for speaker age               feature selection
  recognition
                Speech Classification
 GMM/SVM supervector          Classification methods
  approach for acoustic         for independent “bag of
              (and              observations” systems
  speech features examples of what recommenderfeatures can
                                learn from signal processing)
 Detection task and           Valid application-
  pseudo-NIST evaluation        independent evaluation
  procedure
 Rank and polynomial          Feature space warping
  rank normalization            normalization

 Conclusions
Speech as a Source for Non-Intrusive UM
                                                    Now it’s time to
 Information about                                  get to gate 38.
                           adaptive
 the user
                      speech dialog system

 speaker                                        A
          ?
 classification             user model
 speech = sensor

                      adapts it's dialog
 inference from
                      behavior
 sensors              (e.g. detailed map with
 (not intrusive)      shops vs. arrows)
                                                B
 explicit statement   provides
                      recommendations
 (intrusive)
                      (e.g. a different route
                      to the gate)
        Speaker Classification Systems
                              Cognitive Load
                              Best Research Paper Award
                              UM 2001

                              Age and Gender
                              Voice Award 2007
                        S     Telekom live operation 2009

Audio segment           y     Language
(telephone quality)     s     14 languages + dialects
                              NIST evaluation 2007
                        t
                        e     Identity
                        m     Project with BKA 2009
                              NIST* Evaluation 2008

                               Acoustic Events
                               Project with VW 2008
                               Interspeech 2008
      Recommendations Based on
         Speech Classification
           products   media   services   actions   strategies

age
                                                   
gender
                                                   
emotions
                                                   
language
                                                   
dialect
                                                   
accent
                                                   
identity
                                                   
acoustic
events                                     
Product Recommendations
Based on Age and Gender




        Zur Anzeige wird der QuickTime™
             Dekompre ssor „svq1“
                     benötigt.
Product Recommendations
Based on Age and Gender




                                                         AM
       Michael Feld and Christian Müller. Speaker Classification for Mobile Devices.
       In Proceedings of the 2nd IEEE International Interdisciplinary Conference on
       Portable Information Devices (Portable 2008). 2008
 How can you find features for
  building your models by explicitly
  studying the underlying phenomena?
 Proposing Knowledge-driven
  feature select the example of
  features for speaker age
  recognition
   Speaker Classification as an
Interdisciplinary Area of Research
   Which are the manifestationsspeaker of a speaker be the
      How the the age (and the of age (and gender) in
  Which arecan requirements of a gender) classification system
            speaker’s voice automatically ? ?
                 recognized on speaking style
   and how can they be solvedand the implementation layer ?




    Speech                Speaker           Phonetics
 Technology /           Classificatio         Voice
    Artificial               n              Pathology
  Intelligence

                         Software-
                         Technolog
                             y
Impact of Aging on the Human Speech
             Production
Speech breathing
effects:

lower expirational volume

more speech pauses

lower amplitude


      thorax
               stiffer


       lungs
               lighter
               less elastic
               lower position
Impact of Aging on the Human Speech
             Production
laryngal area

effects:
rise of fundamental frequency (in men)
reduced voice quality



          larynx    calcification and ossification


      vocal folds   loss of tissue
                    stiffening
Impact of Aging on the Human Speech
             Production
supralaryngal area
facial bones and
muscles
                   degeneration
                   reduced elasticity


effects:
imprecise articulation
for example vowel centralization
Impact of Aging on the Human Speech
             Production
neurological
effects
loss of tissue in the cortex
reduced performance of the neuronal transmitters


effects:
reduced articulation rate
defective coordination between the articulators
vowel centralization
Development of F0 in Men / Women

F0 (Hz)
          170
                                                    men
          160

          150

          140                       only non-smokers
                                            women
          130

          120
                                    smokers and non-
          110                       smokers

          100
                                           Linville (2001)
          90                                                  age in years
                20   30   40   50     60     70     80       90
            Age Classes
           Female   Male    age
              CF      CM
Children                    <= 13 years



                       YM
Youth                       14 - 19 years




Adults                AM
                            20 - 64 years




Seniors                     >= 65 Jahren
            Age Classes
           Female           Male    age
                    CF CM
Children                            <= 13 years



                               YM
Youth                               14 - 19 years




Adults                        AM
                                    20 - 64 years




Seniors                             >= 65 Jahren
                           Features

fundamental frequency (pitch)
mean                            pitch_mean

standard deviation              pitch_stddev

min, max and difference         pitch_min / pitch_max / pitch_diff

voice quality
shimmer                         shim_l / shim_ldb / shim_apq3 / shim_apq11 / shim_ddp

jitter                          jitt_l / jitt_la / jitt_rap / jitt_ppq / jitt_ddp

harmonics-to-noise-ratio        harm_mean / harm_stddev


articulation rate               ar_rate

speech pauses                   pause_num / pause_dur
                           Features

fundamental frequency (pitch)
mean
standard deviation
min, max and difference
                                  voice
voice quality
shimmer
jitter
harmonics-to-noise-ratio


articulation rate
                                  speaking style
speech pauses
                               Example Results

                                                                                            C_YF
                                                                                              AF
                                                                                              SF
                                                                                        YM_AM_SM




  C     C     Y      Y     A      A     S   S
  F     M     F      M     F      M     F   M

high jitter value = low voice quality


                                                   fundamental frequency (F0)

                                                Christian Müller. Zweistufige kontextsensitive
                                                Sprecherklassifikation am Beispiel von Alter und Geschlecht
                                                [Two-layered Context-Sensitive Speaker Classification on the
                                                Example of Age and Gender]. AKA, Berlin, 2006
  C    C      Y     Y      A     A      S   S
  F    M      F     M      F     M      F   M

speech pauses
          Hiearchical Feature Model
 High-level features
 (learned characteristics)
                                         semantics
                                              ?
                                           dialog
                                    A
                                    B b
                                    :
                                          b a             e        b
                                        d  d  e                c
                                    :

                                           ideloect
                                 <s> how shall I say this <c> <s> yeah I
                                                 know...


                                          phonetics
                             /S/ /oU/ /m/ /i:/ /D/ /&/ /m/ / / /n/ /i:/ ...


                                           prosody

                                          spectrum
Low-level features
(physical characterstics)
 How can your features be modeled
  assuming that they
   are multi-dimentional
   represent repeating observations of the
    same kind
   can be assumed to be independent
    (“bag” of observations)
 Proposing the GMM/SVM
  Supervector Approach on the
  example of frame-by-frame
  acoustic features
     General Classification Scheme                                                zk


                e.g. channel
                                                                                               wkj
                compensation                                                             -
                                multilayer perceptron
                                support-vector machines                     0.7          0,4
                                                              -
                (not addressed in this
                                networks                      1
Preprocessing   talk)                                                        y1                          y2
                                                                                       -1.5

                                                             0.5
                                                                        1                1       1 wj
                    Feature                                                                          i
                   Extraction                                                1

                                                   x1              x2




                                  Classification



                                                          Fusion
                                                        Top-Down-
                                                        Knowledge
Modeling Acoustics and
      Prosodics
            semantics
                 ?
              dialog
       A
       B b
       :
             b a             e        b
           d  d  e                c
       :

              ideloect                           no ASR
    <s> how shall I say this <c> <s> yeah I
                    know...


             phonetics
/S/ /oU/ /m/ /i:/ /D/ /&/ /m/ / / /n/ /i:/ ...


              prosody

             spectrum
Generative Approach: Gaussian Mixture Model
                  (GMM)
                                   training
“emergency vehicle”                         probability
                                             density      “emergency
                       feature
                                                            vehicle”
                      extraction
                                                             model

  frame of speech

                                     test

   ?                                                       avg likelihood
                                            “emergency     over all frames
                       feature
                                              vehicle”        for class
                      extraction                            “emergency
                                               model
                                                              vehicle”
Generative Approach: Gaussian Mixture Model
                  (GMM)

                                 test

   ?
                                        “emergency
                          feature
                                          vehicle”
                         extraction                       avg. log
                                           model
                                                     likelihood ratio
                                                          over all
       frame of speech                                  frames for
                                                           class
                                                       “emergency
                                                          vehicle”
                                          back-
                                         ground
                                         model
      A Mixture of Gaussians




 Means, variances, and mixtures weights are
  optimized in training
 Black line = mixture of 3 Gaussians
   Discriminative Method:
Support Vector Machine (SVM)
                             training
 “em. vehic.” (1)


                              feature               “em. vehic.”
“not em. vehic.” (-1)        extraction               model




    Features are transformed into higher-dimensional space where problem
     is linear
    Discriminating hyper plane is learned using linear regression
    Trade-off between training error and width of margin
    Model is stored in form of “support vectors” (data points on the margin)
       Discriminative Method:
    Support Vector Machine (SVM)
                               test

?

                         feature                              score
                        extraction                            (distance to
                                                              hyper plane)

     Discriminative methods have shown to be superior to generative
      methods for similar tasks
     Features vectors have to be of the same lengths (sensitive to variable
      segment lengths)
     Solutions:
        feature statistics calculated over the entire utterance
        fixes portion of the segment
        sequential kernels
GMM/SVM Supervector Approach

               feature
              extraction


                                 Gaussian means
                                 (MAP adapted)

  Combines discriminative power of SVMs with length
   independency of GMMs
  Very successful with similar tasks such as speaker
   recognition
  GMM is trained using MAP adaptation
DCF
                            Evaluation Results
 25                                                                 23,41

                                                                              19,55
 20
              14,58
 15
                            10,22
                                          8,09
 10
                                                    3,45
  5

  0
                        t                    d                         d
                   e se                   he                        he
                                        c                         c
             tir                      at                        at           GMM-UBM
           en                        m                        nm
                                                             u               GMM-SVM
      Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection
      for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008.
 How can you evaluate your multi-
  class models independently from
  the given application?
 How can you establish a
  appropriate evaluation in order
  procedure to obtain valid results?
 Proposing the detection task and
  the “pseudo NIST” evaluation
  procedure on the example of
  acoustic event detection and
  speaker age recognition.
                Background
 With multi-class recognition problems, many
  test/analyzing methods are very application
  specific.
   e.g. confusion matrices.
   we want a method that allows results to be
    generalized across a large set of applications.
 With home-grown databases, parameter
  tuning on the evaluation set often
  compromises the validity of the
  results/inferences.
   we want a fair “one shot” evaluation.
          The Detection Task


                         system         yes , 1.324326

emergeny vehicle ?

 Given
   a speech segment (s)
   and an acoustic event to be detected (target event,
    ET )
 the task is to decide whether ET is
  present in s (yes or no)
 the system's output shall also contains a score
  indicating its confidence with more positive
  scores indicating greater confidence.
            Terminology
 Segment class
   e.g. segment event, segment age-class.
   ground truth (not known).
 Target
   the hypothesized class.
 Trial
   a combination of segment and target.
                      Evaluation

                                         yes    1.32432

emergency vehicle ?       system         no    -0.3212
music ?                                  no    1.8463
talking ?                                no    -2.5773
laughing ?                               yes   0.00132
phone ?                                  no    2.20122
no event ?

 The system performance is evaluated by presenting it
  with a set of trials.
 Each test segment is used for multiple trials.
 The absence of all of all targets is explicitly included.
                   Type of Errors
segment “em. vehic.”


                          system    no

target “em. vehic” ?                “MISS”


segment “em. vehic”


                          system    yes

target “phone” ?                    “FALSE ALARM”
      Decision-Error Tradeoff
     misses


                            “equal error rate”




                                            false alarms

 Selecting an operating point (decision threshold) along the
  dotted line trades misses off false alarms.
 Optimal operating point is application dependent.
 Low false alarm rates are desirable for most applications.
       Decision Cost Function
     C(ET, EN) = CMiss · PTarget · PMiss(ET)
     + CFA · (1-PTarget) · PFA (ET,EN)
     where ET and EN are the target and non-target events,
     and CMiss, CFA and PTarget are application model parameters.

     The application parameters for EER are:

     CMiss = CFA = 1        and      PTarget = 0.5


 Weighted sum of misses and false alarms using
  variable costs and priors.
 Application model parameters are selected
  according to the application.
                 Example DET-Plot


miss
probability




                                  false alarm probability
  Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection
  for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008.
           Example Cost Chart
COSTS: (At, An)
An           C       YF      YM      AF      AM      SF      SM
C            --      0.220   0.092   0.145   0.083   0.133   0.069

YF           0.166   --      0.081   0.201   0.080   0.198   0.070

YM           0.076   0.084   --      0.130   0.203   0.108   0.188

AF           0.088   0.161   0.110   --      0.095   0.219   0.082

AM           0.064   0.083   0.254   0.139   --      0.105   0.228

SF           0.096   0.150   0.100   0.249   0.091   --      0.095

SM           0.065   0.085   0.238   0.117   0.246   0.118   --

Avg Cost     0.092   0.130   0.146   0.164   0.133   0.147   0.122
(At)
Avg Cost     0.133
Acoustic GMM/SVM Supervector system on 7-class age task
Pseudo NIST Evaluation Procedure
   ERL provided development and evaluation data as
    representative as possible for the application.
   Three months before the evaluation, ICSI was provided with
    the development data.
   At a pre-determined date, the blind evaluation data was
    provided to ICSI for processing.
   The system's output was submitted to ERL in NIST format.
   ERL downloaded the scoring software from NIST’s website,
    made the necessary modifications due to the changes in the
    labels.
   ERL ran the software on the submitted system output.
   The results were then disclosed to ICSI along with the keys
    (truth) for further analysis.
   --> Fair “one-shot” evaluation, no parameter tuning on the
    evaluation set.
 How can you normalize your
  features in order to obtain a
  uniform scale and a unifom
  distribution?
 Proposing rank normalization
  respectively polynomial rank
  normalization
              Background




 Fundamental frequency (pitch): 75-200 Hz
 Jitter: 0.001324 PPQ
 --> implicit feature weighing
 Mean/Variance Normalization
              1
                             ai =       vi − min(vi)
                                        max(vi) − min(vi)




              -1                    1

 uniform scale
 non-uniform distribution
                 Rank-Normalization
feature                background model          normalized
                                                 feature

0101      0.01         0101   0       0          0101    0.75
...                    0101   0.01    0.25       ...
                       0101   0.06    0.5        0123    0.4
                       0101   0.13    0.75       2317    0.2
                       0101   0.29    1          ...
0101      0.06         ...
...

0101      0.13
...
                  create ordered list of values using bg
0101      0.29     data
...
                  rank = position in list / number of values
                  no occurrence mapped to 0
             Rank Normalization
1                                    1




-1                            1      -1        1

 (+) uniform distribution
 (-) large three dimensional lookup tables
 (-) linear interpolation for unseen values
      larger values ? smaller values ?
   Polynomial Rank Normalization
    use ranks to train a polynomial
    apply polynomial instead of look-up tables




 better interpolation
 no need to store look-up
  tables




                             Christian Müller and Joan-Isaac Biel. The ICSI 2007 Language
                             Recognition System. In Proc     eedings of the Odyssey 2008 Workshop
                             on Speaker and Language Recognition. Stellenbosch, South Africa, 2008
                         Conclusions
    Speech as a source of information for non-intrusive user
     modeling

    Speech/signal processing               Take-away messages

 Vocal aging -> features             Knowledge-driven
  for speaker age                      feature selection
  recognition
 GMM/SVM supervector                 Classification methods
  approach for acoustic                for independent “bag of
  speech features                      observations” features
 Detection task and                  Valid application-
  pseudo-NIST evaluation               independent evaluation
  procedure
 Rank and polynomial                 Feature space warping
  rank normalization                   normalization
Thank you!

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:6
posted:5/1/2011
language:English
pages:49