Power Linear Discriminant Analysis (PLDA)

Document Sample
Power Linear Discriminant Analysis (PLDA) Powered By Docstoc
					 Power Linear Discriminant Analysis (PLDA)

   M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization of Linear Discriminant
   Analysis Used in Segmental Unit Input Hmm for Speech Recognition,” Proc.
   ICASSP, 2007

   M. Sakai, N. Kitaoka and S. Nakagawa, “Selection of Optimal Dimensionality
   Reduction Method Using Chernoff Bound for Segmental Unit Input HMM,”
   Proc. INTERSPEECH, 2007


Reference:
S. Nakagawa and K. Yamamoto, “Evaluation of Segmental Input Unit HMM,” Proc. ICASSP, 1996
K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd Ed.


                                                             Presented by Winston Lee
• M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization
  of Linear Discriminant Analysis Used in Segmental
  Unit Input Hmm for Speech Recognition,” Proc.
  ICASSP, 2007




                                                          2
                              Abstract
• To precisely model the time dependency of features is one of the
  important issues for speech recognition. Segmental unit input HMM
  with a dimensionality reduction method is widely used to address
  this issue. Linear discriminant analysis (LDA) and heteroscedastic
  discriminant analysis (HDA) are classical and popular approaches to
  reduce dimensionality. However, it is difficult to find one particular
  criterion suitable for any kind of data set in carrying out
  dimensionality reduction while preserving discriminative information.
• In this paper, we propose a new framework which we call power
  linear discriminant analysis (PLDA). PLDA can describe various
  criteria including LDA and HDA with one parameter. Experimental
  results show that the PLDA is more effective than PCA, LDA, and
  HDA for various data sets.




                                                                           3
                              Introduction
• Hidden Markov Models (HMMs) have been widely used to model
  speech signals for speech recognition. However, HMMs cannot
  precisely model the time dependency of feature parameters.
   – Output-independent assumption of HMMs: All observations are dependent on the
     state that generated them, not on neighboring observations.
• Segmental unit input HMM is widely (?) used to overcome this
  limitation.
• In segmental unit input HMM, a feature vector is derived from
  several successive frames. The immediate use of several
  successive frames inevitably increases the dimensionality of
  parameters.
• Therefore, a dimensionality reduction method is performed to spliced
  frames.




                                                                                    4
                              Segmental Unit Input HMM
    • The observation sequence y  y1 y2...yT . The state sequence x  x1x2...xT .
      The expression of output probability computation of HMM is :
P  y1...yT 
                                                   
   P yi y1 y2 ...yi  2 yi 1, x1x2 ...xi 1xi P xi x1x2 ...xi 1
  x i
   P  yi yi 3 yi  2 yi 1, xi 1xi P xi xi 1 
  x i
                                                           Bayes’ Rule
      P  yi 3 yi  2 yi 1 yi xi 1xi 
                                       P xi xi 1 
  x i P  yi 3 yi  2 yi 1 xi 1xi 

      P  yi 1 yi xi 1xi 
                           P xi xi 1 
  x i P  yi 1 xi 1xi                                         P  y1...yT 
                                                  Bayes’ Rule
   P  yi yi 1, xi 1xi P xi xi 1                          P x1...xT P  yi ...yT x1...xT 
  x i                                                               x                                      Marginalizing

   P  yi 1 yi xi 1xi P xi xi 1                            P xi x1...xi 1 P yi yi ...yi 1, x1...xi 
  x i                                                               x i
   P  yi xi 1xi P xi xi 1 
   x i

                                                                                                                    5
                 Segmental Unit Input HMM (cont.)

P  y1...yT 
                                                   
   P yi y1 y2 ...yi  2 yi 1, x1x2 ...xi 1xi P xi x1x2 ...xi 1 
  x i
   P  yi yi 3 yi  2 yi 1, xi 1xi P xi xi 1 
  x i
                                                         conditional density HMM of 4-frame segments
      P  yi 3 yi  2 yi 1 yi xi 1xi 
                                       P xi xi 1 
  x i  P  yi 3 yi  2 yi 1 xi 1xi 
      P  yi 1 yi xi 1xi 
                           P xi xi 1 
  x i P  yi 1 xi 1xi                        conditional density HMM of 2-frame segments
   P  yi yi 1, xi 1xi P xi xi 1 
  x i
   P  yi 1 yi xi 1xi P xi xi 1        segmental unit input HMM of 2-frame segments
  x i
   P  yi xi 1xi P xi xi 1  the standard HMM
   x i




                                                                                                       6
                  Segmental Unit Input HMM (cont.)
• The segmental unit input HMM in (Nakagawa, 1996) is
  approximation of
                                           Px
                                     i xi 1 
         P yi 3 yi  2 yi 1 yi xi 1xi
   
   x i     P yi 3 yi  2 yi 1 xi 1xi
     Pyi 3 yi  2 yi 1 yi xi 1xi Pxi xi 1    segmental unit input HMM of 4-frame segments
      x i

• Using segmental unit input HMM wherein several successive frames
  are inputted as one vector, since the dimensions of vector increases,
  it results in a lesser precision in the estimation of the covariance
  matrix.
• In (Nakagawa, 1996), Karhunen-Loeve (K-L) expansion and
  Modified Quadratic Discriminant Function (MQDF) are used to
  deal with the above problem.



                                                                                                   7
                                 K-L Expansion
• Estimation of covariance matrix A  alm  from samples yi 
             I
                              
     alm   y l  y l y im  y m .
               i
            i 1

• Computation of eigenvalues  j  and eigenvectors  j 
                                                      φ

    Aφ j   jφ j .

• Sort of eigenvalues and eigenvectors corresponding to them:
    1  2  3  ...   p

• Computation of parameters having compressed dimension, by using
    y i  Byi
      '


   where the transformation matrix is as follows
        
    B  12 ... p T
                     
                                                                    8
                         K-L Expansion (cont.)
• In the statistical literature, K-L expansion is generally called
  principal components analysis (PCA).
• Some criteria of K-L expansion:
    – minimum mean-square error (MMSE)
    – maximum scatter measure
    – minimum entropy
• Remarks:
    – Why orthonormal linear transformations?
      Ans: To maintain the structure of the distribution.




                                                                     9
                                Review on LDA
• Given n-dimensional features x j  n  j  1,2,...,N ,
  e.g.,                         
                                 T
         x j  oT(d 1) ,...,oT , let us find a transformation matrix B  n p
                 j             j
  that maps these features to p-dimensional features
  Z j   p  j  1,2,...,N   p  n , where Z j  BT x j , and N denotes the
  number of features.
• Within-class covariance matrices:

   Σw 
          1 c
                               
               x j  μk x j  μk
          N k 1 x j Dk
                                     T   Pk Σk
                                          c

                                         k 1

• Between-class covariance matrices:

   Σb   Pk μ k  μ μ k  μ T
           c

          k 1




                                                                                   10
                           Review on LDA (cont.)
• In LDA, the objective function is defined as follows:
                                    ~    ~
                  BT ΣbB            Σb   Σ b  BT Σ b B ,
   J LDA B                  c        ~
                   T
                  B Σ wB            ~    Σ k  BT Σ k B
                                Pk Σk
                               k 1
• LDA finds a transformation matrix B that maximizes the above
  function.
                                                                  1
• The eigenvectors corresponding to the largest eigenvalues of Σ w Σb
  are the solution.




                                                                        11
                                Review on HDA
• LDA is not the optimal transform when the class distributions are
  heteroscedastic.
• HLDA: Kumar incorporated the maximum likelihood estimation of
  parameters for differently distributed Gaussians.
• HDA: Saon proposed another objective function similar to Kumar’s
  and showed its relationship with a constrained maximum likelihood
  estimation.
• Saon’s HDA objective function:
                                    Nk
                    T                           ~    ~
                c  B Σb B                       Σb   Σ b  BT Σ b B ,
   J HDAB                          
               k 1 BT Σ k B                 c        ~
                                              ~P
                                              Σk k    Σ k  BT Σ k B
                               
                                             k 1




                                                                          12
                 Dependency on Data Set
• Figure 1(a) shows that HDA has higher separability than LDA for the
  data set.
• Figure 1(b) shows that LDA has higher separability than HDA for
  another data set.
• Figure 1(c) shows the case with another data set where both LDA
  and HDA have low separabilities.
• All results show that the separabilities of LDA and HDA depend
  significantly on data sets.




                                                                        13
Dependency on Data Set (cont.)




                                 14
            Relationship between LDA and HDA

                                    ~
                  BT ΣbB            Σb
   J LDA B                  c
                                                     ..........
                                             ..........              ......(
                                                             .......... 1)
                      T
                  B Σ wB            ~
                                Pk Σk                                           ~
                               k 1                                               b  BT  b B ,
                                                                                 ~
                    BT Σ B      
                                   Nk              ~                              k  BT  k B
                  c                              Σb
   J HDAB    
                          b
                                                               .......... 2)
                                                        ..........        ..(
               k 1 BT Σ k B                 c~P
                                             Σk k
                                             k 1

• The denominator in Eq. (1) can be viewed as a determinant of the
  weighted arithmetic mean of the class covariance matrices.
• The denominator in Eq. (2) can be viewed as a determinant of the
  weighted geometric mean of the class covariance matrices.




                                                                                                    15
                             PLDA
• The difference between LDA and HDA is the definitions of the mean
  of the class covariance matrices.
• As extension of this interpretation, their denominators can be
  replaced by a determinant of the weighted harmonic mean, or a
  determinant of the root mean square, etc.
• In this paper, a more general definition of a mean is often used,
  called the weighted mean of order m, or the weighted power mean.
• The new approach using the weighted power mean as the
  denominator of the objective function is called Power Linear
  Discriminant Analysis (PLDA).




                                                                      16
                                     PLDA (cont.)
• The new objective function is as follows:
                            ~
                            Σb
   J PLDA B, m  
                                 1/ m
                       c ~m 
                        Pk Σ k 
                                
                       k 1     


• It can be seen that both of LDA and HDA are the subsets of PLDA.
• m=1 (arithmetic mean)                 ~
                                        Σ
                                        J PLDA B,1                J LDA B 
                                                                b
                                                          c   ~
                                                          Pk Σk
                                                         k 1
• m=0 (geometric mean)
                                                              ~
                                                              Σb
                                        J PLDA B,0                J HDAB 
                                                          c ~P
                                                           Σk k
                                                         k 1

                                                                                   17
                                      Appendix A
• weighted power mean:
   – If w , w2 ,...,wn are positive real numbers such that w  w2  ...  wn  1, we
         1                                                  1
     define the r-th weighted power mean of the as:

        r
          
      M w x1, x2 ,...,xn      r
                              w1x1     w2 x2
                                            r
                                                              
                                                            r 1/ r
                                                 ...  wn xn



                    r
                   Mw             symbol weighted mean
                                min           Minimum
                   Mw
                     
                   M w1           H             Harmonic mean
                    0
                   Mw             G             Geometric mean
                   M1
                    w             A             Arithmetic mean
                    2             RMS           Root-mean-square
                   Mw
                                 max           Maximum
                   Mw


                                                                                       18
                                           Appendix B

                                                1/ m
                                 c ~m 
    Let  P x1, x2 ,...,xc     Pk Σk             , we want to find lim  P x1, x2 ,...,xc .
          m                                                                        m
•                                        
                                 k 1                                      m0

                                P : log  P  1 log  Pk Σm 
                                 m                    c    ~
•   First we take logarithm of             m
                                                            k 
                                               m  k 1
                                                    
                                                               
                                                               
• Then                                    c ~m         c    ~m      ~    c         ~
                                      log  Pk Σ k   Pk Σ k log Σ k
                                                                           Pk log Σ k
                                          k 1       k 1              k 1
             lim log  P        lim                                    
                           m
                                              m              c    ~m            c
             m 0                m 0
                                                              Pk Σ k           Pk
                                                                 k 1               k 1
                 c      ~      c    ~ Pk      c ~ Pk             
               Pk log Σ k   log Σ k  log  Σ k
                                                                 
                                                                  
               k 1           k 1            k 1               
                                                                          l’Hôpital’s rule
                                          c
• So lim         P
                  m
                      x1, x2 ,...,xc    Σ k k
                                            ~P
          m 0                           k 1




                                                                                                      19
                                       PLDA (cont.)
• Assuming that a control parameter m is constrained to be an integer,
  the derivatives of the PLDA objective function are formulated as
  follows:
    
      log J PLDA B, m   2ΣbBΣb1  2 Dm
                               ~
   B
         1 c              m
               Pk Σ k B  X m, j , k , m0
         m k 1          j 1
        
                c           ~ 1
   Dm          Pk Σ k BΣ k ,          m0
              k 1
         1 c                 m
          Pk Σ k B  Ym, j , k , otherwise
         m k 1
                            j 1


                                  1                                          1
                  ~ m  j  c ~ m  ~ j 1                 ~ m  j 1 c ~ m  ~  j
   X m, j , k    Σ k   Pl Σl  Σ k ,
                                           Ym, j , k    Σk          Pl Σl  Σ k .
                                                                              
                           l 1                                     l 1    


                                                                                         20
                                         Appendix C
                                               ~
                                             Σb                  c              
    log J PLDA B, m  
                                                            ~   1         ~
                            log                       log Σb  log  Pk Σ m       
                                                   B                              
                                                                            k
 B                      B       c   ~ 
                                            1/ m
                                                               m   k 1            
                                  Pk Σ m 
                                        k 
                                 k 1     
                                   1
       ~ 1      c ~m                  c        ~ m 1
  2ΣbBΣb  2    Pl Σl 
                                       Pk Σ k BΣ k
                 l 1                 k 1

• m>0
                                                 ~                1 
                                                            c ~ m  
               1
  c ~m             c         ~ m 1 c              m 1
   Pl Σl 
                    Pk Σ k BΣ k    Pk Σ k B  Σ k   Pl Σl   
                                                                   
  l 1            k 1              k 1                l 1   
                                                                     
                    m ~                   1      
                            m j 
   1 c                              c ~  ~
                                               j 1  
       Pk Σ k B   Σ k   Pl Σl  Σ k 
                                        m
                      j 1                       
   m k 1 
                                 l 1            
                                                             Note :
          
                                                                        1 m m  j j 1
   1 c              m                                        Am 1B      A     BA , m  0
        Pk Σ k B  X m, j , k                                         m j 1
   m k 1          j 1

                                                                                          21
                           Appendix C (cont.)
• m = 0 (too trivial!)
• m<0
                                             ~               1 
                                                       c ~ m  
                1
   c ~m            c    ~ m 1 c              m 1
    Pl Σl 
              Pk Σ k BΣ k   Pk Σ k B  Σ k   Pl Σl   
                                                              
   l 1     k 1               k 1                l 1   
                                                                
                   m ~                             
                                  c ~ m  ~  j 
                                             1
     1 c 
      Pk Σ k B   Σ m  j 1  Pl Σl  Σ k  
                                          
     m k 1         j 1 k      l 1            
                                                  
                  m
     1 c
     Pk Σ k B  Ym, j , k
     m k 1      j 1



                                                      Note :
                                                        m 1    1 m m j 1  j
                                                       A    B  A        BA , m  0
                                                                m j 1

                                                                                    22
                         The Diagonal Case
• Because of computational simplicity, the covariance matrix in the
  class k is often assumed to be diagonal.
• Since a diagonal matrix multiplication is commutative, the derivatives
  of the PLDA objective function are simplified as follows:
                                                                                    1
    
   B
                              ~ 1  c
                                                  
      log J PLDA B, m  2ΣbBΣb  2  Pk Σk Bdiag Σk
                                                    ~    m 1  c        ~
                                                               Pl diag Σk
                                                                           m
                                                                                
                                                                                
                                     k 1                    l 1            




                                                                                         23
                               Experiments
• Corpus: CENSREC-3
   – The CENSREC-3 is designed as an evaluation framework of Japanese isolated
     word recognition in real driving car environments.
   – Speech data was collected using 2 microphones, a close-talking (CT)
     microphone and a hands-free (HF) microphone.
   – For training, a total of 14,050 utterances spoken by 293 drivers (202 males and
     91 females) were recorded with both microphones.
   – For evaluation, a total of 2,646 utterances spoken by 18 speakers (8 males and
     10 females) were evaluated for each microphone.




                                                                                       24
Experiments (cont.)




                      25
                               P.S.
• Apparently, the deviation of PLDA is merely an induction from LDA
  and HDA.
• The authors doesn’t seem to give any expressive statistical or
  physical meaning about PLDA.
• The experimental results shows PLDA (with some parameter m)
  overperforms the other two approaches, but it does not explained
  why in this paper.
• The revised version of Fisher’s criterion!!!!!
• The concepts of MEAN!!!!!




                                                                      26
• M. Sakai, N. Kitaoka and S. Nakagawa, “Selection of
  Optimal Dimensionality Reduction Method Using
  Chernoff Bound for Segmental Unit Input HMM,” Proc.
  INTERSPEECH, 2007




                                                        27
                            Abstract
• To precisely model the time dependency of features, segmental unit
  input HMM with a dimensionality reduction method has been widely
  used for speech recognition. Linear discriminant analysis (LDA) and
  heteroscedastic discriminant analysis (HDA) are popular approaches
  to reduce the dimensionality. We have proposed another
  dimensionality reduction method called power linear discriminant
  analysis (PLDA) to select the best dimensionality reduction method
  that yields the highest recognition performance. This selection
  process on the basis of trial and error requires much time to train
  HMMs and to test the recognition performance for each
  dimensionality reduction method.
• In this paper we propose a performance comparison method without
  training or testing. We show that the proposed method using the
  Chernoff bound can rapidly and accurately evaluate the relative
  recognition performance.


                                                                        28
          Performance Comparison Method

• Instead of using a recognition error, The class
  separability error of the features in the projected space
  is used as a criterion to estimate the parameter m of
  PLDA.




                                                              29
      Performance Comparison Method (cont.)

• Two-class problem:
   – Bayes error of the projected features on an evaluation data:

         minP p1 (x), P2 p2 (x)dx
                 1


      Pi : prior probabilit y of the class i
      pi (x) : a conditiona l density function of the class i

   – The Bayes error ε can represent a classification error, assuming
     that the training data and the evaluation data come from the
     same distributions.
   – But, it’s hard to measure the Bayes error.




                                                                        30
     Performance Comparison Method (cont.)

• Two-class problem (cont.):
   – Instead, we use the Chernoff bound between class 1 and class 2
     as a class separability error
      u,2  P s P2  s  p1 (x) p1 s (x)dx for 0  s  1
       1
              1
                  1        s
                                  2

                                                               s = 0.5: Bhattacharyya bound
      u : an upper bound of 
   – We can rewrite the above equation as
      u,2  P s P2  s exp( 1,2 ( s)),
       1
              1
                  1


     where
                   s(1  s)                                                1 sΣ  (1  s) Σ 2
     1,2 ( s)             (μ 2  μ1)T ( sΣ1  (1  s) Σ 2 )1(μ 2  μ1)  ln 1 s     1 s
                                                                                              ,
                      2                                                    2   Σ1 Σ2

                                        Covariance matrices are treated as diagonal ones here


                                                                                                  31
Performance Comparison Method (cont.)




                                        32
      Performance Comparison Method (cont.)

• Multi-class problem:
   – it is possible to define several error functions for multi-class data.
     ~  c c I (i, j ) i , j
     u              u
           i 1 j 1


      I () : an indicator function
   – Sum of pairwise approximated errors:
                  1,  if j  i,
      I (i, j )  
                  0, otherwise.
   – Maximum pairwise approximated error

                  1, if j  i and (i, j )  (i , ˆ)
                                             ˆ j
      I (i, j )  
                  0,
                             otherwise.


                                                                              33
     Performance Comparison Method (cont.)

• Multi-class problem (cont.):
   – Sum of maximum approximated errors in each class
                  1, if j  ˆi ,
                            j
      I (i, j )  
                  0, otherwise.
                  

      ˆ  arg max  i , j
      ji           u
               j




                                                        34
Experimental Results




                       35
Experimental Results (cont.)




                               36
              Experimental Results (cont.)

• No comparison method could predict the best
  dimensionality reduction methods simultaneously for
  both of the two evaluation sets.
   – It is supposed that this results from neglecting time information
     of speech feature sequences to measure a class separability
     error and modeling a class distribution as a unimodal normal
     distribution.
• Computational costs




                                                                         37
                           P.S.

• The experimental results didn’t explicitly explain the
  relationship between WER and class separatability error
  for a given m. That is, better class separatability error
  cannot explicitly guarantee better WER. (The authors
  said, they “agree well”.)
• In the experiment, the authors didn’t explain the
  differences among the three criteria when calculating
  approximated errors.
• But this is a good try to take something out from the
  black boxes (WERs).




                                                              38