Document Sample

Power Linear Discriminant Analysis (PLDA) M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization of Linear Discriminant Analysis Used in Segmental Unit Input Hmm for Speech Recognition,” Proc. ICASSP, 2007 M. Sakai, N. Kitaoka and S. Nakagawa, “Selection of Optimal Dimensionality Reduction Method Using Chernoff Bound for Segmental Unit Input HMM,” Proc. INTERSPEECH, 2007 Reference: S. Nakagawa and K. Yamamoto, “Evaluation of Segmental Input Unit HMM,” Proc. ICASSP, 1996 K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd Ed. Presented by Winston Lee • M. Sakai, N. Kitaoka and S. Nakagawa, “Generalization of Linear Discriminant Analysis Used in Segmental Unit Input Hmm for Speech Recognition,” Proc. ICASSP, 2007 2 Abstract • To precisely model the time dependency of features is one of the important issues for speech recognition. Segmental unit input HMM with a dimensionality reduction method is widely used to address this issue. Linear discriminant analysis (LDA) and heteroscedastic discriminant analysis (HDA) are classical and popular approaches to reduce dimensionality. However, it is difficult to find one particular criterion suitable for any kind of data set in carrying out dimensionality reduction while preserving discriminative information. • In this paper, we propose a new framework which we call power linear discriminant analysis (PLDA). PLDA can describe various criteria including LDA and HDA with one parameter. Experimental results show that the PLDA is more effective than PCA, LDA, and HDA for various data sets. 3 Introduction • Hidden Markov Models (HMMs) have been widely used to model speech signals for speech recognition. However, HMMs cannot precisely model the time dependency of feature parameters. – Output-independent assumption of HMMs: All observations are dependent on the state that generated them, not on neighboring observations. • Segmental unit input HMM is widely (?) used to overcome this limitation. • In segmental unit input HMM, a feature vector is derived from several successive frames. The immediate use of several successive frames inevitably increases the dimensionality of parameters. • Therefore, a dimensionality reduction method is performed to spliced frames. 4 Segmental Unit Input HMM • The observation sequence y y1 y2...yT . The state sequence x x1x2...xT . The expression of output probability computation of HMM is : P y1...yT P yi y1 y2 ...yi 2 yi 1, x1x2 ...xi 1xi P xi x1x2 ...xi 1 x i P yi yi 3 yi 2 yi 1, xi 1xi P xi xi 1 x i Bayes’ Rule P yi 3 yi 2 yi 1 yi xi 1xi P xi xi 1 x i P yi 3 yi 2 yi 1 xi 1xi P yi 1 yi xi 1xi P xi xi 1 x i P yi 1 xi 1xi P y1...yT Bayes’ Rule P yi yi 1, xi 1xi P xi xi 1 P x1...xT P yi ...yT x1...xT x i x Marginalizing P yi 1 yi xi 1xi P xi xi 1 P xi x1...xi 1 P yi yi ...yi 1, x1...xi x i x i P yi xi 1xi P xi xi 1 x i 5 Segmental Unit Input HMM (cont.) P y1...yT P yi y1 y2 ...yi 2 yi 1, x1x2 ...xi 1xi P xi x1x2 ...xi 1 x i P yi yi 3 yi 2 yi 1, xi 1xi P xi xi 1 x i conditional density HMM of 4-frame segments P yi 3 yi 2 yi 1 yi xi 1xi P xi xi 1 x i P yi 3 yi 2 yi 1 xi 1xi P yi 1 yi xi 1xi P xi xi 1 x i P yi 1 xi 1xi conditional density HMM of 2-frame segments P yi yi 1, xi 1xi P xi xi 1 x i P yi 1 yi xi 1xi P xi xi 1 segmental unit input HMM of 2-frame segments x i P yi xi 1xi P xi xi 1 the standard HMM x i 6 Segmental Unit Input HMM (cont.) • The segmental unit input HMM in (Nakagawa, 1996) is approximation of Px i xi 1 P yi 3 yi 2 yi 1 yi xi 1xi x i P yi 3 yi 2 yi 1 xi 1xi Pyi 3 yi 2 yi 1 yi xi 1xi Pxi xi 1 segmental unit input HMM of 4-frame segments x i • Using segmental unit input HMM wherein several successive frames are inputted as one vector, since the dimensions of vector increases, it results in a lesser precision in the estimation of the covariance matrix. • In (Nakagawa, 1996), Karhunen-Loeve (K-L) expansion and Modified Quadratic Discriminant Function (MQDF) are used to deal with the above problem. 7 K-L Expansion • Estimation of covariance matrix A alm from samples yi I alm y l y l y im y m . i i 1 • Computation of eigenvalues j and eigenvectors j φ Aφ j jφ j . • Sort of eigenvalues and eigenvectors corresponding to them: 1 2 3 ... p • Computation of parameters having compressed dimension, by using y i Byi ' where the transformation matrix is as follows B 12 ... p T 8 K-L Expansion (cont.) • In the statistical literature, K-L expansion is generally called principal components analysis (PCA). • Some criteria of K-L expansion: – minimum mean-square error (MMSE) – maximum scatter measure – minimum entropy • Remarks: – Why orthonormal linear transformations? Ans: To maintain the structure of the distribution. 9 Review on LDA • Given n-dimensional features x j n j 1,2,...,N , e.g., T x j oT(d 1) ,...,oT , let us find a transformation matrix B n p j j that maps these features to p-dimensional features Z j p j 1,2,...,N p n , where Z j BT x j , and N denotes the number of features. • Within-class covariance matrices: Σw 1 c x j μk x j μk N k 1 x j Dk T Pk Σk c k 1 • Between-class covariance matrices: Σb Pk μ k μ μ k μ T c k 1 10 Review on LDA (cont.) • In LDA, the objective function is defined as follows: ~ ~ BT ΣbB Σb Σ b BT Σ b B , J LDA B c ~ T B Σ wB ~ Σ k BT Σ k B Pk Σk k 1 • LDA finds a transformation matrix B that maximizes the above function. 1 • The eigenvectors corresponding to the largest eigenvalues of Σ w Σb are the solution. 11 Review on HDA • LDA is not the optimal transform when the class distributions are heteroscedastic. • HLDA: Kumar incorporated the maximum likelihood estimation of parameters for differently distributed Gaussians. • HDA: Saon proposed another objective function similar to Kumar’s and showed its relationship with a constrained maximum likelihood estimation. • Saon’s HDA objective function: Nk T ~ ~ c B Σb B Σb Σ b BT Σ b B , J HDAB k 1 BT Σ k B c ~ ~P Σk k Σ k BT Σ k B k 1 12 Dependency on Data Set • Figure 1(a) shows that HDA has higher separability than LDA for the data set. • Figure 1(b) shows that LDA has higher separability than HDA for another data set. • Figure 1(c) shows the case with another data set where both LDA and HDA have low separabilities. • All results show that the separabilities of LDA and HDA depend significantly on data sets. 13 Dependency on Data Set (cont.) 14 Relationship between LDA and HDA ~ BT ΣbB Σb J LDA B c .......... .......... ......( .......... 1) T B Σ wB ~ Pk Σk ~ k 1 b BT b B , ~ BT Σ B Nk ~ k BT k B c Σb J HDAB b .......... 2) .......... ..( k 1 BT Σ k B c~P Σk k k 1 • The denominator in Eq. (1) can be viewed as a determinant of the weighted arithmetic mean of the class covariance matrices. • The denominator in Eq. (2) can be viewed as a determinant of the weighted geometric mean of the class covariance matrices. 15 PLDA • The difference between LDA and HDA is the definitions of the mean of the class covariance matrices. • As extension of this interpretation, their denominators can be replaced by a determinant of the weighted harmonic mean, or a determinant of the root mean square, etc. • In this paper, a more general definition of a mean is often used, called the weighted mean of order m, or the weighted power mean. • The new approach using the weighted power mean as the denominator of the objective function is called Power Linear Discriminant Analysis (PLDA). 16 PLDA (cont.) • The new objective function is as follows: ~ Σb J PLDA B, m 1/ m c ~m Pk Σ k k 1 • It can be seen that both of LDA and HDA are the subsets of PLDA. • m=1 (arithmetic mean) ~ Σ J PLDA B,1 J LDA B b c ~ Pk Σk k 1 • m=0 (geometric mean) ~ Σb J PLDA B,0 J HDAB c ~P Σk k k 1 17 Appendix A • weighted power mean: – If w , w2 ,...,wn are positive real numbers such that w w2 ... wn 1, we 1 1 define the r-th weighted power mean of the as: r M w x1, x2 ,...,xn r w1x1 w2 x2 r r 1/ r ... wn xn r Mw symbol weighted mean min Minimum Mw M w1 H Harmonic mean 0 Mw G Geometric mean M1 w A Arithmetic mean 2 RMS Root-mean-square Mw max Maximum Mw 18 Appendix B 1/ m c ~m Let P x1, x2 ,...,xc Pk Σk , we want to find lim P x1, x2 ,...,xc . m m • k 1 m0 P : log P 1 log Pk Σm m c ~ • First we take logarithm of m k m k 1 • Then c ~m c ~m ~ c ~ log Pk Σ k Pk Σ k log Σ k Pk log Σ k k 1 k 1 k 1 lim log P lim m m c ~m c m 0 m 0 Pk Σ k Pk k 1 k 1 c ~ c ~ Pk c ~ Pk Pk log Σ k log Σ k log Σ k k 1 k 1 k 1 l’Hôpital’s rule c • So lim P m x1, x2 ,...,xc Σ k k ~P m 0 k 1 19 PLDA (cont.) • Assuming that a control parameter m is constrained to be an integer, the derivatives of the PLDA objective function are formulated as follows: log J PLDA B, m 2ΣbBΣb1 2 Dm ~ B 1 c m Pk Σ k B X m, j , k , m0 m k 1 j 1 c ~ 1 Dm Pk Σ k BΣ k , m0 k 1 1 c m Pk Σ k B Ym, j , k , otherwise m k 1 j 1 1 1 ~ m j c ~ m ~ j 1 ~ m j 1 c ~ m ~ j X m, j , k Σ k Pl Σl Σ k , Ym, j , k Σk Pl Σl Σ k . l 1 l 1 20 Appendix C ~ Σb c log J PLDA B, m ~ 1 ~ log log Σb log Pk Σ m B k B B c ~ 1/ m m k 1 Pk Σ m k k 1 1 ~ 1 c ~m c ~ m 1 2ΣbBΣb 2 Pl Σl Pk Σ k BΣ k l 1 k 1 • m>0 ~ 1 c ~ m 1 c ~m c ~ m 1 c m 1 Pl Σl Pk Σ k BΣ k Pk Σ k B Σ k Pl Σl l 1 k 1 k 1 l 1 m ~ 1 m j 1 c c ~ ~ j 1 Pk Σ k B Σ k Pl Σl Σ k m j 1 m k 1 l 1 Note : 1 m m j j 1 1 c m Am 1B A BA , m 0 Pk Σ k B X m, j , k m j 1 m k 1 j 1 21 Appendix C (cont.) • m = 0 (too trivial!) • m<0 ~ 1 c ~ m 1 c ~m c ~ m 1 c m 1 Pl Σl Pk Σ k BΣ k Pk Σ k B Σ k Pl Σl l 1 k 1 k 1 l 1 m ~ c ~ m ~ j 1 1 c Pk Σ k B Σ m j 1 Pl Σl Σ k m k 1 j 1 k l 1 m 1 c Pk Σ k B Ym, j , k m k 1 j 1 Note : m 1 1 m m j 1 j A B A BA , m 0 m j 1 22 The Diagonal Case • Because of computational simplicity, the covariance matrix in the class k is often assumed to be diagonal. • Since a diagonal matrix multiplication is commutative, the derivatives of the PLDA objective function are simplified as follows: 1 B ~ 1 c log J PLDA B, m 2ΣbBΣb 2 Pk Σk Bdiag Σk ~ m 1 c ~ Pl diag Σk m k 1 l 1 23 Experiments • Corpus: CENSREC-3 – The CENSREC-3 is designed as an evaluation framework of Japanese isolated word recognition in real driving car environments. – Speech data was collected using 2 microphones, a close-talking (CT) microphone and a hands-free (HF) microphone. – For training, a total of 14,050 utterances spoken by 293 drivers (202 males and 91 females) were recorded with both microphones. – For evaluation, a total of 2,646 utterances spoken by 18 speakers (8 males and 10 females) were evaluated for each microphone. 24 Experiments (cont.) 25 P.S. • Apparently, the deviation of PLDA is merely an induction from LDA and HDA. • The authors doesn’t seem to give any expressive statistical or physical meaning about PLDA. • The experimental results shows PLDA (with some parameter m) overperforms the other two approaches, but it does not explained why in this paper. • The revised version of Fisher’s criterion!!!!! • The concepts of MEAN!!!!! 26 • M. Sakai, N. Kitaoka and S. Nakagawa, “Selection of Optimal Dimensionality Reduction Method Using Chernoff Bound for Segmental Unit Input HMM,” Proc. INTERSPEECH, 2007 27 Abstract • To precisely model the time dependency of features, segmental unit input HMM with a dimensionality reduction method has been widely used for speech recognition. Linear discriminant analysis (LDA) and heteroscedastic discriminant analysis (HDA) are popular approaches to reduce the dimensionality. We have proposed another dimensionality reduction method called power linear discriminant analysis (PLDA) to select the best dimensionality reduction method that yields the highest recognition performance. This selection process on the basis of trial and error requires much time to train HMMs and to test the recognition performance for each dimensionality reduction method. • In this paper we propose a performance comparison method without training or testing. We show that the proposed method using the Chernoff bound can rapidly and accurately evaluate the relative recognition performance. 28 Performance Comparison Method • Instead of using a recognition error, The class separability error of the features in the projected space is used as a criterion to estimate the parameter m of PLDA. 29 Performance Comparison Method (cont.) • Two-class problem: – Bayes error of the projected features on an evaluation data: minP p1 (x), P2 p2 (x)dx 1 Pi : prior probabilit y of the class i pi (x) : a conditiona l density function of the class i – The Bayes error ε can represent a classification error, assuming that the training data and the evaluation data come from the same distributions. – But, it’s hard to measure the Bayes error. 30 Performance Comparison Method (cont.) • Two-class problem (cont.): – Instead, we use the Chernoff bound between class 1 and class 2 as a class separability error u,2 P s P2 s p1 (x) p1 s (x)dx for 0 s 1 1 1 1 s 2 s = 0.5: Bhattacharyya bound u : an upper bound of – We can rewrite the above equation as u,2 P s P2 s exp( 1,2 ( s)), 1 1 1 where s(1 s) 1 sΣ (1 s) Σ 2 1,2 ( s) (μ 2 μ1)T ( sΣ1 (1 s) Σ 2 )1(μ 2 μ1) ln 1 s 1 s , 2 2 Σ1 Σ2 Covariance matrices are treated as diagonal ones here 31 Performance Comparison Method (cont.) 32 Performance Comparison Method (cont.) • Multi-class problem: – it is possible to define several error functions for multi-class data. ~ c c I (i, j ) i , j u u i 1 j 1 I () : an indicator function – Sum of pairwise approximated errors: 1, if j i, I (i, j ) 0, otherwise. – Maximum pairwise approximated error 1, if j i and (i, j ) (i , ˆ) ˆ j I (i, j ) 0, otherwise. 33 Performance Comparison Method (cont.) • Multi-class problem (cont.): – Sum of maximum approximated errors in each class 1, if j ˆi , j I (i, j ) 0, otherwise. ˆ arg max i , j ji u j 34 Experimental Results 35 Experimental Results (cont.) 36 Experimental Results (cont.) • No comparison method could predict the best dimensionality reduction methods simultaneously for both of the two evaluation sets. – It is supposed that this results from neglecting time information of speech feature sequences to measure a class separability error and modeling a class distribution as a unimodal normal distribution. • Computational costs 37 P.S. • The experimental results didn’t explicitly explain the relationship between WER and class separatability error for a given m. That is, better class separatability error cannot explicitly guarantee better WER. (The authors said, they “agree well”.) • In the experiment, the authors didn’t explain the differences among the three criteria when calculating approximated errors. • But this is a good try to take something out from the black boxes (WERs). 38

DOCUMENT INFO

Shared By:

Categories:

Tags:
linear discriminant analysis, dimensionality reduction, data set, covariance matrices, reduction method, speech recognition, objective function, recognition performance, face recognition, ieee trans, business wire, business and industry, transformation matrix, comparison method, chernoff bound

Stats:

views: | 33 |

posted: | 1/19/2010 |

language: | English |

pages: | 38 |

OTHER DOCS BY mtr14643

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.