VIEWS: 91 PAGES: 5 CATEGORY: Emerging Technologies POSTED ON: 4/9/2011 Public Domain
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 3, 2011 Performance Comparison of Speaker Identification using circular DFT and WHT Sectors Dr. H B Kekre1, Vaishali Kulkarni2, Indraneal Balasubramanian3, Abhimanyu Gehlot4, Rasik Srinath5 1 Senior Professor, Computer Dept., MPSTME, NMIMS University. hbkekre@yahoo.com 2 Associate Professor, EXTC Dept., MPSTME, NMIMS University. Vaishalikulkarni6@yahoo.com 3, 4, 5 students, B-Tech EXTC, MPSTME, NMIMS University. indraneal89@gmail.com, abhimanyu13090@gmail.com, rasik90@gmail.com Abstract— In this paper we aim to provide a unique approach to identification using power distribution in the frequency domain text dependent speaker identification using transform techniques [11], [12]. We have also proposed speaker recognition using such as DFT (Discrete Fourier Transform) and WHT (Walsh vector quantization in time domain by using LBG (Linde Buzo Hadamard Transform). In the first method, the feature vectors Gray), KFCG (Kekre’s Fast Codebook Generation) and KMCG are extracted by dividing the complex DFT spectrum into (Kekre’s Median Codebook Generation) algorithms [13 – 15] circular sectors and then taking the weighted density count of the and in transform domain using DFT (Discrete Fourier number of points in each of these sectors. In the second method, Transform), DCT (Discrete Cosine Transform) and DST the feature vectors are extracted by dividing the WHT spectrum (Discrete Sine Transform) [16]. into circular sectors and then again taking the weighted density count of the number of points in each of these sectors. Further, The concept of sectorization has been used for (CBIR) comparison of the two transforms shows that the accuracy content based image retrieval. [17] – [21]. We have proposed obtained for DFT is more (80%) than that obtained for WHT speaker identification using circular DFT sectors [22]. In this (66%). paper, we propose speaker identification using WHT (Walsh Hadamard Transform), and also compare the results with DFT Keywords - Speaker identification; Circular Sectors; weighted sectors. In Fig. 1, we can see how a basic speaker identification density; Euclidean distance system operates. A number of speech samples are collected from a variety of speakers, and then their features are extracted I. INTRODUCTION and stored as reference models in a database. When a speaker is Human speech conveys an abundance of information, from to be identified, the features of his speech are extracted and the language and gender to the identity of the person speaking. compared with all of the reference speaker models. The The purpose of a speaker recognition system is thus to extract reference model which gives the minimum Euclidean distance the unique characteristics of a speech signal that identify a with the feature vector of the person to be identified is the particular speaker [1 - 4]. Speaker recognition systems are maximum likelihood model and is declared as the person usually classified into two subdivisions, speaker identification identified. and speaker verification [2 – 5]. Speaker identification (also known as closed set identification) is a 1: N matching process II. where the identity of a person must be determined from a set of known speakers [7]. Speaker verification (also known as open III. set identification) serves to establish whether the speaker is who he claims to be [8]. Speaker identification can be further IV. classified into text-dependent and text-independent systems. In a text dependent system, the system knows what utterances to V. expect from the speaker. However, in a text-independent system, no assumptions about the text can be made, and the system must be more flexible than a text dependent system [4, VI. 5, and 8]. VII. EASE OF USE Speaker recognition systems find use in a multitude of applications today including automated call processing in A. Selecting a Template (Heading 2) telephone networks as well as query systems such as stock FF information, weather reports etc. However, difficulties in wide deployment of such systems are a practical limitation that is yet to be overcome [2, 6, 7, 9, and 10]. We have proposed speaker Figure 1. Speaker Identification System 139 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 3, 2011 II. SECTORIZATION OF THE COMPLEX TRANSFORM PLANES The speech signal has amplitude range from -1 to +1. It is A. Discrete Fourier Transform(DFT) first converted into positive values by adding +1 to all the The DFT transforms time or space based data into sample values. Thus the amplitude range of the speech signal is frequency-based data. The DFT allows you to efficiently now from 0 to 2. For sectorization two methods are used, estimate component frequencies in data from a discrete set of which are described below: values sampled at a fixed rate [23, 24]. If the speech signal is represented by y (t), then the DFT of the time series or A. DFT Sectorization samples y0, y1,y2, …..yN-1 is defined as given by (1): The algorithm for DFT sectorization is given below: 1. The DFT of the speech signal is computed. Since the DFT -2jπkn/N Yk = ne is symmetrical, only half of the number of points in the DFT is considered while drawing the complex DFT plane (1) (i.e. Yreal vs. Yimag). Where yn=ys (nΔt); k= 0, 1, 2…, N-1. Δt is the sampling interval. 2. Also the first point in DFT is a real number, so it is considered separately while taking feature vectors. So the B. Walsh Hadamard Transform complex plane is only from (2, N/2), where N is the The Walsh transform or Walsh–Hadamard transform is a number of points in DFT. Fig. 2 shows the original speech non-sinusoidal, orthogonal transformation technique that signal and its complex DFT plane for one of the samples decomposes a signal into a set of basis functions. These basis in the database. functions are Walsh functions, which are rectangular or square waves with values of +1 or –1. The Walsh–Hadamard 3. For dividing the complex plane into sectors, the transform returns sequency values. Sequency is a more magnitude of the DFT is considered as the radius of the generalized notion of frequency and is defined as one half of circular sector as in (3): the average number of zero-crossings per unit time interval. Each Walsh function has a unique sequency value. You can Radius (R) = abs (sqrt ((Yreal)2+(Yimag)2)) (3) use the returned sequency values to estimate the signal frequencies in the original signal. The Walsh–Hadamard 4. Table I shows the range of the radius taken for dividing transform is used in a number of applications, such as image the DFT plane into circular sectors. processing, speech processing, filtering, and power spectrum analysis. It is very useful for reducing bandwidth storage 1 requirements and spread-spectrum analysis [25]. Like the FFT, 0.5 the Walsh–Hadamard transform has a fast version, the fast Amplitude Walsh–Hadamard transform (fwht). Compared to the FFT, 0 the FWHT requires less storage space and is faster to calculate -0.5 because it uses only real additions and subtractions, while the FFT requires complex values. The FWHT is able to represent -1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 signals with sharp discontinuities more accurately using fewer No. of samples 4 x 10 coefficients than the FFT. FWHTh is a divide and conquer 400 algorithm that recursively breaks down a WHT of size N into 300 two smaller WHTs of size N / 2. This implementation follows the recursive definition of the Hadamard matrix HN 200 given by (2): 100 Ximag 0 -100 (2) The normalization factors for each stage may be -200 grouped together or even omitted. The Sequency ordered, also -300 known as Walsh ordered, fast Walsh–Hadamard transform, -400 FWHTw, is obtained by computing the FWHTh as above, and -400 -300 -200 -100 0 Xreal 100 200 300 400 then rearranging the outputs. The rest of the paper is organized as follows: Section II explains the sectorization process, Section III explains the Figure 2. Speech signal and its complex DFT plane feature extraction using the density of the samples in each of the sectors, Section IV deals with Feature Matching, and results 5. The maximum range of the radius for forming the sectors are explained in Section V and the conclusion in section VI. was found by experimenting on the different samples in Identify applicable sponsor/s here. (sponsors) 140 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 3, 2011 the database. Various combinations of the range were tried and the values given in Table I was found to be 300 satisfactory. Fig. 3 shows the seven sectors formed for the complex plane shown in Fig. 2. Different colours have 200 been used to show the different sectors. 6. The seven circular sectors were further divided into four 100 quadrants each as given by Table II. Thus we get 28 sectors for each of the samples. Fig. 4 shows the 28 0 sectors formed for the sample shown in Fig. 2. -100 TABLE I. RADIUS RANGE OF THE CIRCULAR SECTORS -200 Sr. Radius range Sector Weighing No. factor 1 0≤R≤4 Sector1 2/256 -300 2 4≤R≤8 Sector2 6/256 -300 -200 -100 0 100 200 300 3 8≤R≤16 Sector3 12/256 4 16≤R32 Sector4 24/256 5 32≤R≤64 Sector5 48/256 Figure 4. Sectorization of DFT plane into 28 sectors for the speech 6 64≤R≤128 Sector6 96/256 sample shown in Fig. 2 7 128≤R≤256 Sector7 192/256 1. The WHT of the speech signal is taken using FWHT 250 (Fast Walsh Hadamard Transform). 225 200 2. The WHT can be represented as (C0, S0, C1, S1, C2, 175 150 S2, …….., CN-1, SN-1), C represents Cal term and S 125 represents Sal term. 100 75 3. The Walsh transform matrix is real but by 50 25 multiplying all Sal Components by j it can be made 0 complex. The first term i.e. C0 represents dc value. So -25 -50 the complex plane is considered by combining S0 -75 with C1, S1 with C2 and so on. In this case SN-1 will be -100 left out. Thus C0 and SN-1 are considered separately. -125 -150 -175 4. The complex Walsh transform is then divided into -200 circular sectors as shown by (4). Again the radial -225 sectors are formed using the radius as shown in Table -250 -250 -225 -200 -175 -150 -125 -100 -75 -50 -25 0 25 50 75 100 125 150 175 200 225 250 I. Figure 3. Circular Sectors of the complex DFT plane of the speech Radius (R) = abs (sqrt ((Ycal)2+(Ysal)2)) (4) sample shown in Fig. 2 5. The seven circular sectors were further divided into TABLE II. DIVISION INTO FOUR QUADRANTS four quadrants as explained in (A) by using Table II. Thus we get 28 sectors for each of the samples. Sr. value Quadrant No. 1 Xreal≥0 & Ximag≥0 1 (00 – 900 ) III. FEATURE VECTOR EXTRACTION 2 Xreal≤0 & Ximag≥0 2 (900 – 1800) For feature vector generation, the count of the number of 3 Xreal≤0 & Ximag≤0 3 (1800 – 2700) points in each of the sectors is first taken. Then feature vector 4 Xreal≥0 & Ximag≤0 4 (2700 – 3600) is calculated for each of the sectors according to (5). Feature vector = ((count/n1)*weighing factor)*10000 (5) B. WHT Sectorization The algorithm for Walsh Sectorization is given below: For DFT, the first value i.e. dc component is accounted as in (6). For WHT, C0 is accounted as given by (6) and SN-1 is considered as given by (7). Overall there are eight components in the feature vector for DFT (one per sector and first term). 141 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 3, 2011 Similarly, there are nine components in the feature vector for decreases. When the complex plane is further divided into 56 WHT (one per sector, first term and last term), when the seven sectors, there is a improvement in accuracy for less number of circular sectors are considered. When 28 sectors are samples, but as the number of samples is increased considered there are 29 components in the feature vector (one performance is similar as that with 28 sectors. Fig. 6 shows the per sector and first term) for DFT and 30 components in the feature vector (one per sector, first term and last term) for WHT. First term = sqrt (abs (first value of DFT/WHT)) (6) Last term = sqrt (abs (Last value of FWHT)) (7) IV. RESULTS A. Database description The speech samples used in this work are recorded using Sound Forge 4.5. The sampling frequency is 8000 Hz (8 bit, mono PCM samples). Table II shows the database description. The samples are collected from different speakers. Samples are taken from each speaker in two sessions so that training model and testing data can be created. Twelve samples per speaker are taken. The samples recorded in one session are kept in database and the samples recorded in second session are used for testing. TABLE III. DATABASE DESCRIPTION Figure 5. Accuracy for DFT Sectorization Parameter Sample characteristics Language English No. of Speakers 30 Speech type Read speech Recording conditions Normal. (A silent room) Sampling frequency 8000 Hz Resolution 8 bps B. Experimentation This algorithm was tested for text dependent speaker identification. Feature vectors for both the methods described in section II were calculated as shown in section III. For testing, the test sample is similarly processed and feature vector is calculated. For recognition, the Euclidean distance between the features of the test sample and the features of all the samples stored in the database is computed. The sample in the database for which the Euclidean distance is minimum, is declared as the speaker recognized. C. Accuracy of Identification The accuracy of the identification system is calculated as given by equation 5. (5) Fig. 5 shows the results obtained for DFT sectorization. As Figure 6. Accuracy for WHT Sectorization seen from the results, when the complex DFT plane is divided into seven sectors, the maximum accuracy is around 80% and results obtained for WHT sectorization. Here also we see that decreases as the number of samples in the database is increased accuracy improves as the number of sectors is increased from (64% for 30 samples). It can be seen that accuracy increases 7 to 28. But further division into 56 sectors does not give any when the number of sectors into which the complex DFT plane advantage. Overall the results obtained for DFT are better than is divided, is increased from 7 to 28. With 28 sectors, the those obtained for WHT. maximum accuracy is 80% up to 20 samples after which it 142 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 3, 2011 V. CONCLUSION [17] H B Kekre, Dhirendra Mishra, “Performance Comparison of Density Distribution and Sector mean of sal and cal functions in Walsh Speaker Identification using the concept of Sectorization has Transform Sectors as Feature Vectors for Image Retrieval ” , been proposed in this paper. The complex DFT and WHT International Journal of Image Processing ,Volume :4, Issue:3, 2010. plane has been divided into circular sectors and feature vectors [18] H B Kekre, Dhirendra Mishra, “CBIR using Upper Six FFT Sectors of Color Images for Feature Vector Generationl”, International Journal of have been calculated using weighted density. Accuracy Engineering and Technology,Volume :2(2) ”, 2010. increases when the 7 circular sectors are divided into 28 [19] H B Kekre, Dhirendra Mishra, “Performance Comparison of Four, sectors for both the transform techniques. But there is no Eight & Twelve Walsh Transform Sectors Feature Vectors for Image significant improvement when the complex plane is further Retrieval from Image Databases”, International Journal of Engineering divided. The results also show that the performance of DFT is Science and Technology”, Volume :2(5) , 2010. better than WHT. [20] H B Kekre, Dhirendra Mishra, “ Four Walsh Transform Sectors Feature Vectors for Image Retrieval from Image Databases ” , International Journal of Computer Science and Information Technologies”, Volume :1(2) , 2010. REFERENCES [21] H B Kekre, Dhirendra Mishra, “Digital Image Search & Retrieval using FFT Sectors of Color Images”, International Journal of Computer [1] Lawrence Rabiner, Biing-Hwang Juang and B.Yegnanarayana, Science and Engineering”, Volume :2 , No.2, 2010. “Fundamental of Speech Recognition”, Prentice-Hall, Englewood Cliffs, [22] H B Kekre, Vaishali Kulkarni, “Automatic Speaker Recognition using 2009. circular DFT Sector”, Interanational Conference and Workshop on [2] S Furui, “50 years of progress in speech and speaker recognition Emerging Trends in Technology (ICWET 2011), 25-26 February, 2011. research”, ECTI Transactions on Computer andInformation Technology, [23] Bergland, G. D. "A Guided Tour of the Fast Fourier Transform." IEEE Vol. 1, No.2, November 2005. Spectrum 6, 41-52, July 1969 [3] D. A. Reynolds, “An overview of automatic speaker recognition [24] Walker, J. S. Fast Fourier Transform, 2nd ed. Boca Raton, FL: CRC technology,” Proc. IEEE Int. Conf. Acoust., Speech,S on Speech and Press, 1996. Audio Processing, Vol. 7, No. 1, January 1999. IEEE, New York, NY, [25] Terry Ritter, Walsh-Hadamard Transforms: A Literature Survey, Aug. U.S.A 1996. [4] S. Furui. Recent advances in speaker recognition. AVBPA97, pp 237-- 251, 1997 [5] J. P. Campbell, ``Speaker recognition: A tutorial,'' Proceedings of the IEEE, vol. 85, pp. 1437--1462, September 1997. AUTHORS PROFILE [6] D. A. Reynolds, “Experimental evaluation of features for robust speaker identification,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. Dr. H. B. Kekre has received B.E. (Hons.) in Telecomm. Engg. from Jabalpur 639–643, Oct. 1994. University in 1958, M.Tech (Industrial Electronics) from IIT Bombay in 1960, [7] Tomi Kinnunen, Evgeny Karpov, and Pasi Fr¨anti, “Realtime Speaker M.S.Engg. (Electrical Engg.) from University of Ottawa in 1965 and Ph.D. Identification”, ICSLP2004. (System Identification) from IIT Bombay in 1970. He has worked Over 35 years as Faculty of Electrical [8] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin- Engineering and then HOD Computer Science and Engg. Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D.Petrovska- at IIT Bombay. For last 13 years worked as a Professor in Delacrétaz, and D. A. Reynolds, “A tutorial on text-independent speaker Department of Computer Engg. at Thadomal Shahani verification,” EURASIP J. Appl. Signal Process., vol. 2004, no. 1, pp. Engineering College, Mumbai. He is currently Senior 430–451, 2004. Professor working with Mukesh Patel School of Technology Management and [9] Marco Grimaldi and Fred Cummins, “Speaker Identification using Engineering, SVKM’s NMIMS University, Vile Parle(w), Mumbai, INDIA. Instantaneous Frequencies”, IEEE Transactions on Audio, Speech, and He ha guided 17 Ph.D.s, 150 M.E./M.Tech Projects and several B.E./B.Tech Language Processing, vol., 16, no. 6, August 2008. Projects. His areas of interest are Digital Signal processing, Image Processing [10] Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi, Yu. (1999). “Binary and Computer Networks. He has more than 300 papers in National / Quantization of Feature Vectors for Robust Text-Independent Speaker International Conferences / Journals to his credit. Recently twelve students Identification” in IEEE Transactions. working under his guidance have received best paper awards. Recently two [11] Dr. H B Kekre, Vaishali Kulkarni,”Speaker Identification using Power research scholars have received Ph. D. degree from NMIMS University Distribution in Frequency Spectrum”, Technopath, Journal of Science, Currently he is guiding ten Ph.D. students. He is member of ISTE and IETE. Engineering & Technology Management, Vol. 02, No.1, January 2010. [12] Dr. H B Kekre, Vaishali Kulkarni, “Speaker Identification by using Vaishali Kulkarni has received B.E in Electronics Power Distribution in Frequency Spectrum”, ThinkQuest - 2010 Engg. from Mumbai University in 1997, M.E (Electronics International Conference on Contours of Computing Technology”, and Telecom) from Mumbai University in 2006. Presently BGIT, Mumbai,13th -14th March 2010. she is pursuing Ph. D from NMIMS University. She has a [13] H B Kekre, Vaishali Kulkarni, “Speaker Identification by using Vector teaching experience of more than 8 years. She is Associate Quantization”, International Journal of Engineering Science and Professor in telecom Department in MPSTME, NMIMS Technology, May 2010. University. Her areas of interest include Speech processing: Speech and Speaker Recognition. She has 10 papers in National / [14] H B Kekre, Vaishali Kulkarni, “Performance Comparison of Speaker International Conferences / Journals to her credit. Recognition using Vector Quantization by LBG and KFCG ” , International Journal of Computer Applications, vol. 3, July 2010. [15] H B Kekre, Vaishali Kulkarni, “ Performance Comparison of Automatic Speaker Recognition using Vector Quantization by LBG KFCG and KMCG”, International Journal of Computer Science and Security, Vol: 4 Issue: 5, 2010. [16] H B Kekre, Vaishali Kulkarni, “Comparative Analysis of Automatic Speaker Recognition using Kekre’s Fast Codebook Generation Algorithm in Time Domain and Transform Domain ” , International Journal of Computer Applications, Volume 7 No.1. September 2010. 143 http://sites.google.com/site/ijcsis/ ISSN 1947-5500