VIEWS: 211 PAGES: 6 CATEGORY: Emerging Technologies POSTED ON: 2/15/2011 Public Domain
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011 Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms Dr. H B Kekre Vaishali Kulkarni Senior Professor, Computer Department, Associate Professor, Electronics & Telecommunication, MPSTME, NMIMS University, MPSTME, NMIMS University, Mumbai, India Mumbai, India hbkekre@yahoo.com Vaishalikulkarni6@yahoo.com Abstract— In this paper we propose Speaker Identification using Although many new techniques have been developed, four different Transform Techniques. The feature vectors are the widespread deployment of applications and services is still not row mean of the transforms for different groupings. Experiments possible. None of these systems gives accurate and reliable were performed on Discrete Fourier Transform (DFT), Discrete results. We When you open have proposed speaker recognition Cosine Transform (DCT), Discrete Sine Transform (DST) and using vector quantization in time domain by using LBG (Linde Walsh Transform (WHT). All the Transform give an accuracy of Buzo Gray), KFCG (Kekre’s Fast Codebook Generation) and more than 80% for the different groupings considered. Accuracy KMCG (Kekre’s Median Codebook Generation) algorithms increases as the number of samples grouped is increased from 64 [11], [12], [13] and in transform domain using DFT, DCT and onwards. But for groupings more than 1024 the accuracy again DST [14]. starts decreasing. The results show that DST performs best. The maximum accuracy obtained for DST is 96% for a grouping of The concept of row mean of the transform techniques has 1024 samples while taking the transform. been used for content based image retrieval (CBIR) [15 – 18]. This technique also has been applied on speaker identification Keywords - Euclidean distance, Row mean, Speaker Identification, by first converting the speech signal into a spectrogram [19]. Speaker Recognition For the purposes of this paper, we will be considering a speaker identification system that is text-dependent. For the identification purpose, the feature vectors are extracted by I. INTRODUCTION taking the row mean of the transforms (Which is a column Human speech conveys an abundance of information, from the vector). The technique is used as shown in figure 1. Here a language and gender to the identity of the person speaking. The speech signal of 15 samples is divided into 3 blocks of 5 each, purpose of a speaker recognition system is thus to extract the and these 3 blocks form the columns of the matrix whose transform is taken. Then the mean of the absolute value of each unique characteristics of a speech signal that identify a row of the transform matrix is taken and this forms the column particular speaker. [1, 2, 3] Speaker recognition systems are vector of mean. usually classified into two subdivisions, speaker identification and speaker verification. Speaker identification (also known as The rest of the paper is organized as follows: Section 2 closed set identification) is a 1: N matching process where the explains feature generation using the transform techniques, identity of a person must be determined from a set of known Section 3 deals with Feature Matching, and the results are speakers [3 - 5]. Speaker verification (also known as open set explained in Section 4 and the conclusion in section 5. identification) serves to establish whether the speaker is who he II. TRANSFORM TECHNIQUES claims to be [6]. Speaker recognition can be further classified into text-dependent and text-independent systems. In a text A. Discrete Fourier Transform dependent system, the system knows what utterances to expect Spectral analysis is the process of identifying component from the speaker. However, in a text-independent system, no frequencies in data. For discrete data, the computational basis assumptions about the text can be made, and the system must be of spectral analysis is the discrete Fourier transform (DFT). more flexible than a text dependent system. The DFT transforms time- or space-based data into frequency- Speaker recognition technology has made it possible to use based data. The DFT allows you to efficiently estimate the speaker's voice to control access to restricted services, for component frequencies in data from a discrete set of values example, for giving commands to computer, phone access to sampled at a fixed rate. If the speech signal is represented by banking, database services, shopping or voice mail, and access y(t), then the DFT of the time series or samples y0, y1,y2, to secure equipment. Speaker Recognition systems have been …..yN-1 is defined as: developed for a wide range of applications [7 - 10]. 102 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011 -2jπkn/N C. Discrete Sine Transform Yk = ne (1) A discrete sine transform (DST) expresses a sequence of finitely many data points in terms of a sum of sine functions. Where yn=ys (nΔt); k= 0, 1, 2…, N-1. Δt is the sampling interval. ) Speech signal (1 × 15) (3) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Where y(k) is the sine transform, k=1,…, N. D. Walsh Transform Dividing The Walsh transform or Walsh–Hadamard transform is a non- into blocks sinusoidal, orthogonal transformation technique that of 5 Mean of decomposes a signal into a set of basis functions. These basis Transform each row functions are Walsh functions, which are rectangular or square 1 6 11 C1 waves with values of +1 or –1. The Walsh–Hadamard 2 7 12 C2 transform returns sequency values. Sequency is a more T C3 generalized notion of frequency and is defined as one half of 3 8 13 the average number of zero-crossings per unit time interval. 4 9 14 C4 Each Walsh function has a unique sequency value. You can 5 10 15 C5 use the returned sequency values to estimate the signal Transform matrix Row Mean frequencies in the original signal. The Walsh–Hadamard Speech signal transform is used in a number of applications, such as image converted into (5 × 3) (1 × 5) processing, speech processing, filtering, and power spectrum matrix (5 × 3) analysis. It is very useful for reducing bandwidth storage requirements and spread-spectrum analysis. Like the FFT, the Figure 1. Row Mean Generation Technique Walsh–Hadamard transform has a fast version, the fast Walsh–Hadamard transform (fwht). Compared to the FFT, the FWHT requires less storage space and is faster to calculate B. Discrete Cosine Transform because it uses only real additions and subtractions, while the A discrete cosine transform (DCT) expresses a sequence of FFT requires complex values. The FWHT is able to represent finitely many data points in terms of a sum of cosine functions signals with sharp discontinuities more accurately using fewer oscillating at different frequencies. coefficients than the FFT. FWHTh is a divide and conquer algorithm that recursively breaks down a WHT of size N into two smaller WHTs of size N / 2. This implementation follows the recursive definition of the Hadamard (2) matrix HN: Where y(k) is the cosine transform, k=1,…, N. (4) The normalization factors for each stage may be grouped k=1 together or even omitted. The Sequency ordered, also known as Walsh ordered, fast Walsh–Hadamard transform, FWHT w, is obtained by computing the FWHT h as above, and then rearranging the outputs [23]. 2≤k≤N III. FEATURE EXTRACTION The procedure for feature vector extraction is given below: The DCT is closely related to the discrete Fourier transform. 1. The speech signal is divided into groups of n samples. You can often reconstruct a sequence very accurately from (Where n can take values: 64, 128, 256, 512, 1024, only a few DCT coefficients, a useful property for applications 2048, and 4096) samples. requiring data reduction [20 – 22]. 2. These blocks are then arranged as columns of a matrix and then the different transforms given in section II are taken. 103 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011 3. The mean of the absolute values of the rows of the transform matrix is then calculated. IV. RESULTS 4. These row means form a column vector (1 × n where A. Basics of speech signal n is the number of rows in the transform matrix). The speech samples used in this work are recorded using 5. This column vector forms the feature vector for Sound Forge 4.5. The sampling frequency is 8000 Hz (8 bit, the speech sample. mono PCM samples). Table I shows the database description. The samples are collected from different speakers. Samples are 6. The feature vectors for all the speech samples are taken from each speaker in two sessions so that training model calculated for different values of n and stored in the and testing data can be created. Twelve samples per speaker are database. taken. The samples recorded in one session are kept in database Figure 2 shows the row mean generated for the four and the samples recorded in second session are used for testing. transforms for a grouping of 64 samples for one of the speech signal in the databases. These 64 row means form the feature vector for the particular sample considered. In a similar fashion, the feature vectors for other speech signals were also calculated. This process was repeated for all values of n. As can be seen from figure 2, the 64 mean values form a 1×64 feature vector. Row Mean for DFT for a grouping of 64 samples Row mean for DCT for a grouping of 64 samples 3 0.4 2.5 0.3 2 Amplitude Amplitude 1.5 0.2 1 0.1 0.5 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Mean of the absolute value for each row of the transform matrix Row mean of the absolute value for each row of the Transform matrix (A) (B) Row mean for DST for a grouping of 64 samples Row Mean for Walsh for a grouping of 64 samples 2 0.04 1.5 0.03 Amplitude Amplitude 1 0.02 0.5 0.01 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Row mean of the absolute value for each row of the Transform matrix Mean of the absolute value for each row of the Transform matrix 104 http://sites.google.com/site/ijcsis/ (C) ISSN 1947-5500 (D) Figure 2. Row Mean Generation for a grouping of 64 samples for one of the speech signal (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011 TABLE I. DATABASE DESCRIPTION TABLE II. NO. OF MATCHES FOR DIFFERENT GROUPINGS Parameter Sample characteristics Language English No. of Speakers 105 No. of Number of matches (out of 105) Speech type Read speech samples Recording conditions Normal. (A silent room) grouped FFT DCT DST WALSH Sampling frequency 8000 Hz Resolution 8 bps 64 78 85 86 76 B. Expermental Results 128 87 92 98 79 The feature vectors of all the reference speech samples are 256 96 98 99 82 stored in the database in the training phase. In the matching phase, the test sample that is to be identified is taken and 512 97 99 98 85 similarly processed as in the training phase to form the feature vector. The stored feature vector which gives the minimum 1024 100 97 101 89 Euclidean distance with the input sample feature vector is declared as the speaker identified. 2048 100 96 97 85 Table II gives the number of matches for the four different 4096 98 96 99 83 transforms. The matching has been calculated by considering the minimum Euclidean distance between the feature vector of 8192 96 90 90 67 the test speech signal and the feature vector of the speech signals stored in the database. The rows of Table II show the number of samples of each speech signal grouped together to form the columns of a matrix whose transform is then taken. C. Accuracy of Identification For each grouping, the transform which gives maximum The accuracy of the identification system is calculated as matches has been shaded in yellow. We can see that for given by equation 5. groupings of 64, 128 and 256 DST gives the best matching i.e. 86, 98 and 99 (out of 105) respectively. For a grouping of 512, (5) DCT gives best matching i.e. 99. For a grouping of 1024 samples, DST gives maximum matches i.e. 101. It can also be The accuracy for the different groupings of the four transforms seen that as the number of samples grouped is further increased was calculated and is shown in Figure 3. beyond 1024, the number of matches is reduced for all the transforms. 105 Figure 3. Accuracy for the four transforms by varying the groupings of samples http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011 [12] H B Kekre, Vaishali Kulkarni, “Performance Comparison of Speaker Recognition using Vector Quantization by LBG and KFCG ” , The results show the accuracy increases as we increase the International Journal of Computer Applications, vol. 3, July 2010. feature vector size from 64 to 512 for the transforms. Only for [13] H B Kekre, Vaishali Kulkarni, “ Performance Comparison of DST, the accuracy decreases from 94.28% to 93.33% as we Automatic Speaker Recognition using Vector Quantization by LBG increase the feature vector size from 256 to 512. The feature KFCG and KMCG”, International Journal of Computer Science and vector size of 1024 gives the best result for all the transforms Security, Vol: 4 Issue: 4, 2010. except DCT. For DCT, the best result is obtained for a feature [14] H B Kekre, Vaishali Kulkarni, “Comparative Analysis of Automatic Speaker Recognition using Kekre’s Fast Codebook Generation vector size of 512. For DFT, the maximum accuracy obtained Algorithm in Time Domain and Transform Domain ” , International is 95.2381% for a feature vector size of 1024. Walsh transform Journal of Computer Applications, Volume 7 No.1. September 2010. gives a maximum accuracy of around 84.7619%. DST [15] Dr. H.B.Kekre, Sudeep D. Thepade, Akshay Maloo “Performance performs best giving a maximum accuracy of 96.1905% for a Comparision of Image Retrieval using Row Mean of Transformed feature vector size of 1024. Column Image”, International Journal on Computer Science and Engineering Vol. 02, No. 05, 2010, 1908-1912 V. CONCLUSION [16] Dr.H.B.Kekre,Sudeep Thepade “Edge Texture Based CBIR using Row Mean of Transformed Column Gradient Image”, International Journal of In this paper we have compared the performance of four Computer Applications (0975 – 8887) Volume 7– No.10, October 2010 different transforms for speaker identification. All the [17] Dr. H.B.Kekre, Sudeep D. Thepade, Akshay Maloo “Eigenvectors of Transforms give an accuracy of more than 80% for the feature Covariance Matrix using Row Mean and Column Mean Sequences for Face Recognition”, International Journal of Biometrics and vector size considered. Accuracy increases as the feature Bioinformatics (IJBB), Volume (4): Issue (2) vector size is increased from 64 onwards. But for feature [18] Dr. H.B.Kekre, Sudeep Thepade, Archana Athawale, “Grayscale Image vector size of more than 1024 the accuracy again starts Retrieval using DCT on Row mean, Column mean and Combination”, decreasing. The results show that DST performs best. The Journal of Sci., Engg. & Tech. Mgt. Vol 2 (1), January 2010 maximum accuracy obtained for DST is around 96% for a [19] Dr. H. B. Kekre, Dr. T. K. Sarode, Shachi J. Natu, Prachi J. Natu feature vector size of 1024. The present study is ongoing and “Performance Comparison of Speaker Identification Using DCT, Walsh, Haar on Full and Row Mean of Spectrogram”, International Journal of we are analyzing the performance on other transforms. Computer Applications (0975 – 8887) Volume 5– No.6, August 2010 [20] N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete Cosine Transform", IEEE Trans. Computers, 90-93, Jan 1974. REFERENCES [21] N. Ahmed, "How I came up with the Discrete Cosine Transform", [1] Lawrence Rabiner, Biing-Hwang Juang and B.Yegnanarayana, Digital Signal Processing, Vol. 1,1991. “Fundamental of Speech Recognition”, Prentice-Hall, Englewood Cliffs, [22] G. Strang, “The Discrete Cosine Transform,” SIAM Review, Volume 2009. 41, Number 1,1999. [2] S Furui, “50 years of progress in speech and speaker recognition [23] Fino, B.J., and Algazi, V.R., 1976, "Unified Matrix Treatment of the research”, ECTI Transactions on Computer andInformation Technology, Fast Walsh–Hadamard Transform," IEEE Transactions on Computers Vol. 1, No. 2, November 2005. 25: 1142–1146 [3] D. A. Reynolds, “An overview of automatic speaker recognition technology,” Proc. IEEE Int. Conf. Acoust., Speech,S [4] Joseph P. Campbell, Jr., Senior Member, IEEE, “Speaker Recognition: A Tutorial”, Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, September 1997. [5] S. Furui. Recent advances in speaker recognition. AVBPA97, pp 237-- AUTHORS PROFILE 251, 1997 [6] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin- Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D.Petrovska- Delacrétaz, and D. A. Reynolds, “A tutorial on text-independent speaker Dr. H. B. Kekre has received B.E. (Hons.) in verification,” EURASIP J. Appl. Signal Process., vol. 2004, no. 1, pp. Telecomm. Engg. from Jabalpur University in 430–451, 2004. 1958, M.Tech (Industrial Electronics) from IIT [7] D. A. Reynolds, “Experimental evaluation of features for robust speaker Bombay in 1960, M.S.Engg. (Electrical Engg.) identification,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. from University of Ottawa in 1965 and Ph.D. 639–643, Oct. 1994. (System Identification) from IIT Bombay in 1970. He has worked Over 35 years as Faculty of [8] Tomi Kinnunen, Evgeny Karpov, and Pasi Fr¨anti, “Realtime Speaker Identification”, ICSLP2004. Electrical Engineering and then HOD Computer Science and Engg. at IIT Bombay. For last 13 years worked as a Professor in [9] Marco Grimaldi and Fred Cummins, “Speaker Identification using Department of Computer Engg. at Thadomal Shahani Engineering College, Instantaneous Frequencies”, IEEE Transactions on Audio, Speech, and Mumbai. He is currently Senior Professor working with Mukesh Patel School Language Processing, vol., 16, no. 6, August 2008. of Technology Management and Engineering, SVKM’s NMIMS University, [10] Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi, Yu. (1999). “Binary Vile Parle(w), Mumbai, INDIA. He ha guided 17 Ph.D.s, 150 M.E./M.Tech Quantization of Feature vectors for robust text-independent Speaker Projects and several B.E./B.Tech Projects. His areas of interest are Digital Identification” in IEEE Transactions on Speech and Audio Processing Signal processing, Image Processing and Computer Networks. He has more Vol. 7, No. 1, January 1999. IEEE, New York, NY, U.S.A. than 300 papers in National / International Conferences / Journals to his credit. [11] H B Kekre, Vaishali Kulkarni, “Speaker Identification by using Vector Recently twelve students working under his guidance have received best paper Quantization”, International Journal of Engineering Science and awards. Recently two research scholars have received Ph. D. degree from Technology, May 2010. NMIMS University Currently he is guiding ten Ph.D. students. He is member of ISTE and IETE. 106 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 1, 2011 Vaishali Kulkarni has received B.E in Electronics Engg. from Mumbai University in 1997, M.E (Electronics and Telecom) from Mumbai University in 2006. Presently she is pursuing Ph. D from NMIMS University. She has a teaching experience of more than 8 years. She is Associate Professor in telecom Department in MPSTME, NMIMS University. Her areas of interest include Speech processing: Speech and Speaker Recognition. She has 8 papers in National / International Conferences / Journals to her credit. 107 http://sites.google.com/site/ijcsis/ ISSN 1947-5500