Document Sample

of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 – International JournalINTERNATIONAL JOURNAL OF ELECTRONICS AND 6464(Print), ISSN 0976 – 6472(Online) Volume 3, Issue 2, July-September (2012), © IAEME COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) ISSN 0976 – 6464(Print) ISSN 0976 – 6472(Online) Volume 3, Issue 2, July- September (2012), pp. 87-93 IJECET © IAEME: www.iaeme.com/ijecet.html Journal Impact Factor (2011): 0.8500 (Calculated by GISI) ©IAEME www.jifactor.com NOVEL APPROACH TO TEXT INDEPENDENT SPEAKER IDENTIFICATION Pallavi P. Ingale. Dr. S.L. Nalbalwar Department of Electronics and Department of Electronics and Telecommunications Telecommunications Dr. Babasaheb Ambedkar Technological Dr. Babasaheb Ambedkar Technological University, Lonere, INDIA. University, Lonere, INDIA. pallavi.ingale@rediffmail.com nalbalwar_sanjayan@yahoo.com ABSTRACT In this paper, we propose Speaker Identification using two transforms, namely Kekre’s Transform and Kekre’s Wavelet Transform. The speech signal is spoken by a particular speaker is converted into a spectrogram by first taking 25 or 50% overlapping frames between consecutive sample vectors and arranged in the form of matrix. In order to improve the performance, intially log of matrix is taken and then one of transform is applied on the spectrogram. The resultant transformed matrix forms the feature vector, which is used in the training as well as matching phases of identification. The results of both the transform techniques have been compared by using feature vectors obtained. From the comparision it is observe that, Kekre’s Transform shows much better performance for 50% overlap. Keywords - Speaker Identification, Kekre’s Transform, Kekre’s Wavelet Transform, Short Time Fourior Transform, Spectrogram, Feature extraction I. INTRODUCTION For over six decades, scientists have studied the ability of human listeners to recognize and discriminate voices. By establishing the factors that convey speaker-dependent information, researchers have been able to improve the naturalness of synthetic and vocoded speech and assess the reliability of speaker. Soon after the development of digital computers, research on speaker recognition turned to developing objective techniques for automatic speaker recognition, which quickly led to the discovery that simple automatic systems could outperform human listeners on a similar task [1]. Over the last three decades, researchers have developed increasingly sophisticated automatic speaker recognition algorithms, and the performance of these algorithms in more realistic 87 International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 3, Issue 2, July-September (2012), © IAEME evaluation speech corpora has improved. Today, task-specific speaker- recognition systems are being deployed in large telecommunications applications. [2]. Speaker recognition involves two tasks: identification and verification. In identification, the goal is to determine which voice in a known group of voices best matches the speaker. In verification, the goal is to determine if the speaker is who he or she claims to be. In speaker identification, the unknown voice is assumed to be from the predefined set of known speakers. For this type of classification problem an N alternative, forced-choice task-errors are defined as misrecognitions and the difficulty of identification generally increases as the speaker set increases. Speaker verification requires distinguishing a speaker's voice known to the system from a potentially large group of voices unknown to the system [1], [2]. Speaker-recognition tasks are further distinguished by the constraints placed on the text of the speech used in the system. In a text-dependent system, the spoken text used to train and test the system is constrained to be the same word or phrase. In a text-independent system, training and testing speech is completely unconstrained. This type of system is required for applications which lack control over what a person says [3]. In this paper we have proposed a different approach of speaker identification by using the spectrograms, Kekre’s Transform and Kekre’s Wavelet Transform [5], [6]. Kekre’s Transform and Kekre’s Wavelet Transform are taken on the spectrogram of the speech signal taken on a specific frame sizes. The generalized block diagram of the feature extraction method is shown in Fig. 1. As shown in Fig. 1, the reference signals in the database are first converted into their spectrograms. Then log is taken and one of the two transforms is applied on the spectrogram. The feature vectors are extracted and models are created using Gaussian mixture modeling technique in [3], [4]. The test signal to be identified is similarly processed and the feature vector is extracted, model is created and is matched with the previously created and stored model. Speaker whose model has the maximum posterior probability for the input feature-vector sequence (the test signal) is declared as the speaker identified. Section 2 describes the process of converting the speech signal into a spectrogram [5]. The Kekre’s Transform and Kekre’s Wavelet Transform have been explained in section 3 [6]. In Section 4, the feature vector extraction is explained. Results are discussed in section 5 and conclusion in section 6. II. GENERATION OF SPECTROGRAM The first step in the speaker identification system is to convert the speech signal into a spectrogram [5]. A spectrogram is a time-varying spectral representation that shows how the spectral density of a signal varies with time. Spectrograms have been used for speaker Identification since a very long time. Spectrograms are usually calculated from the time signal using the short-time Fourier transform (STFT). Creating a spectrogram using the STFT is usually a digital process. Digitally sampled data, in the time domain, is broken up into chunks, which usually overlap, and Fourier transformed to calculate the magnitude of the frequency spectrum for each chunk. Each chunk then corresponds to a vertical line in the image; a measurement of magnitude versus frequency for a specific moment in time. The spectrums or time plots are then laid side by side to form the image or a three- dimensional surface. This is done using the following steps: 1. The speech signal is first divided into frames, (here frame of size 256) with an overlap of 50% or 25%. 88 International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 3, Issue 2, July-September (2012), © IAEME 2. These frames are arranged column wise to form a matrix. E.g. if the speech signal is a one dimensional signal of 44096×1. We divide this into frames of 256 samples each with an overlap of 50% between consecutive frames i.e. overlap of 64. These frames are then arranged column wise to form a matrix of dimension 128×344. 3. Discrete Fourier Transform (DFT) is applied to this matrix column wise. 4. The spectrogram is then taken as the squared magnitude of this transform matrix. SPEECH SIGNAL Speech Signal Framing (With Overlap) & forming Matrix Converting Framed Speech Signal into Spectrogram Take Squared Magnitude Log Kekre’s Transform/ Kekre’s Wavelet Transform Cepstral Coeficients Generated Using Kekre’s Transform and Kekre’s Wavelet Transform Figure 1. Feature extraction method III. KEKRE’S TRANSFORM AND KEKRE’S WAVELET TRANSFORM Kekre’s Transform matrix (K) can be of any size NxN, which need not have to be in powers of 2 (as is the case with most of other transforms including Haar Transform). All upper diagonal and diagonal values of Kekre’s transform matrix are one, while the lower diagonal part except the values just below diagonal are zero [5], [6]. Generalized N×N Kekre’s Transform Matrix can be given as in (1). 89 International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 3, Issue 2, July-September (2012), © IAEME 1 1 1….. 1 1 -N+1 1 1….. 1 1 KNxN = 0 -N+2 1….. 1 1 (1) : : : : : 0 0 0….. 1 1 0 0 0…..-N+(N-1) 1 The formula for generating the term Kxy of Kekre’s transform matrix is given by (2). 1 : x≤y Kxy = -N + (x-1) : x=y+1 (2) 0 : x > y+1 Also Kekre’s Wavelet transform matrix (KW) has been developed using (N×N) Kekre’s Transform Matrix. Kekre’s Wavelet transform is derived from Kekre’s transform [5]. From NxN Kekre’s transform matrix, we can generate Kekre’s Wavelet transform matrices of size (2N)x(2N), (3N)x(3N),……, (N2)x(N2). For example, from 5x5 Kekre’s transform matrix, we can generate Kekre’s Wavelet transform matrices of size 10x10, 15x15, 20x20 and 25x25. In general MxM Kekre’s Wavelet transform matrix can be generated from NxN Kekre’s transform matrix, such that M = N * P where P is any integer between 2 and N that is, 2 ≤ P ≤ N. Consider the Kekre’s transform matrix of size NxN shown in Figure – 2. K11 K12 K13 … K1(N-1) K1N K21 K22 K23 … K2(N-1) K2N K31 K32 K33 … K3(N-1) K3N : : : : : KN1 KN2 KN3 … KN(N-1) KNN Figure 2. Kekre’s Transform Matrix (K) of size NxN Kekre’s transform on a column vector f is given by F=[K] f (3) MxM Kekre’s Wavelet transform matrix generated from NxN Kekre’s transform matrix. First N numbers of rows of Kekre’s Wavelet transform matrix are generated by repeating every column of Kekre’s transform matrix P times. K(N-P+2) (N-P+1) K(N-P+2)(N-P+2) … K(N-P+2) N K(N-P+3) (N-P+1) K(N-P+3)(N-P+2)…. K(N-P+3)N : : : : : : KN(N-P+1) KN(N-P+2) ……… KNN Figure 3. Temporary matrix T of size (P-1) x P To generate remaining (M-N) rows, extract last (P-1) rows and last P columns from Kekre’s transform matrix and store extracted elements in to temporary matrix say T of size (P-1) x P . Figure - 3 shows extracted elements of Kekre’s transform matrix stored in T. Values of matrix T can be computed as: 90 International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 3, Issue 2, July-September (2012), © IAEME T(x,y) = K( N-P+(x+1), N-P+ y) ; 1≤ x≤ (P-1) , 1≤ y≤ P (4) First row of T is used to generate (N+1) to 2N rows of Kekre’s Wavelet transform matrix. Second row of T is used to generate (2N+1) to 3N rows of Kekre’s Wavelet transform matrix. Likewise last row of T is used to generate ((P-1)N + 1)to PN rows. Example- To generate Kekre’s Wavelet Transform: Figure -4 shows Kekre’s Wavelet transform matrix of size 15x15 generated from the Kekre’s transform matrix 5x5. Here M=15, N=5 and P=M/N=3. 1 1 1 1 1 -4 1 1 1 1 0 -3 1 1 1 (5) 0 0 -2 1 1 0 0 0 -1 1 Above matrix is used to generate Kekre’s Wavelet transform matrix of size 15x15. As shown in Figure - 4, all the columns of Kekre’s transform matrix are repeated P=3 times to generate first N=5 number of rows of Kekre’s Wavelet transform matrix. To generate remaining (M-N) = 10 rows, extract last (P-1) = 2 rows and last P=3 columns from Kekre’s transform matrix and store these elements into temporary matrix T. Below is temporary matrix T. T = -2 1 1 0 -1 1 (6) The first row of T: [-2 1 1] is used to generate next 5-10 rows of KW transform matrix as shown in the Figure – 4. Second row of T [0 -1 1] is used to generate last 11-15 rows of KW transform matrix [6]. The Kekre’s Wavelet Transform on a column vector f is given by F = [KW] f (7) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -4 -4 -4 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 -3 -3 -3 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 -2 -2 -2 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -2 1 1 0 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 1 Figure 4. 15x15 Kekre’s Wavelet transform matrix generated from 5x5 Kekre’s transform matrix. 91 International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 3, Issue 2, July-September (2012), © IAEME IV. FEATURE VECTOR EXTRACTION The procedure for feature vector extraction is given below. Spectrogram is calculated by given procedure in section 2. Spectrogram gives the three dimensional information i.e. time, frequency and amplitude. In spectrogram we get a very wide range of amplitude levels (very small values accompanied with very high values). While processing these wide ranges of values, generally we will not be able to produce faithful results. Because when these values are taken linearly (as it is.), the very high amplitude values will affect the output more and lower will affect less. But, instead of this if we first apply log to the spectrogram values, then the range of values of result will be shortened. It becomes more manageable. Therefore log is applied on the spectrogram of the speech signal. Column Transform (Kekre’s Transform/ Kekre’s Wavelet Transform) is applied on the spectrogram of the speech signal. This transform matrix forms the features for the speech signal. The features for different speakers are calculated and used to create models for every speaker in the database. V. RESULTS A. Experimental Results In identification, the goal is to determine which voice in a known group of voices best matches the speaker. For this Gaussian Mixture Modeling technique is used. The identification system is a straightforward maximum likelihood classifier. For a reference group of speaker models the objective is to find the speaker identity whose model has the maximum posterior probability for the input feature-vector sequence [3], [4]. B. Accuracy of Identification The accuracy of the identification [3] system is calculated as given by equation (8). Percentage of Identification accuracy No. of segments correctly Identified Total no. of segments under test (8) Test segment length is taken 5 seconds. Accuracy is checked for 24 speakers. Table I shows the results obtained by using the two transforms for an overlap of 50% and 25% between the adjacent frames while creating the spectrograms of the speech signals. Kekre’s Wavelet Transform matrix of size 8x8 is obtained from Kekre’s transform matrix of size 4x4. Kekre’s Wavelet Transform matrix of size 32x32 is obtained from Kekre’s transform matrix of size 8x8. In Table 1, the accuracy is obtained for a feature vector size of 256 and overlap of 50% TABLE 1 OVERLAP OF 50% Transform used Kekre’s Size of Kekre’s Wavelet transform matrix Transform Transform 8x8 94.44 89.40 32x32 94.96 92.53 For Kekre’s transform, the average accuracy is slightly more than Kekre’s Wavelet transform. This accuracy is again greater for transform matrix of size 32x32. In Table 1, the accuracy is obtained for a feature vector size of 256 and overlap of 25% 92 International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 – 6464(Print), ISSN 0976 – 6472(Online) Volume 3, Issue 2, July-September (2012), © IAEME TABLE 2 OVERLAP OF 25% Transform used Kekre’s Size of Kekre’s Wavelet transform matrix Transform Transform 8x8 91.49 88.54 32x32 93.57 91.49 By observing Table 2, for 25% overlapping frames, the average accuracy is slightly less than that for 50% overlapping frames for both the transforms. CONCLUSION In this paper we have compared the performance of Kekre’s transform and Kekre’s Wavelet transforms for speaker identification. Kekre’s transform gives an average accuracy of around 94%. Kekre’s Wavelet transform gives an average accuracy of 91%. These both accuracies are obtained for a feature vector size of 256 and 50% overlap of consecutive frames of speech signal. For 25% overlapping frames, the average accuracy is slightly less than that for 50% overlapping frames for both the transforms. A slight difference in accuracy is observed for the change in size of Kekre’s Wavelet Transform matrix. Overall Kekre’s transform gives better results as compared to Kekre’s Wavelet transform for 50% overlap of consecutive frames of speech signal. REFERENCES [1] B.S.Atal, “Automatic recognition of speakers from their voices,” in Proc. IEEE, vol.64, no. 4, pp. 460-475, Apr. 1976. [2] Joseph P. Campbell, Jr., Senior Member, IEEE, “Speaker Recognition: A Tutorial”, Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, September 1997. [3] Douglas A. Reynolds, “Automatic Speaker Recognition Using Gaussian Mixture Speake Models”, vol.8, Nov.2, 1995, The Lincoln Laboratory journal, pp. 173-192. [4] D.A. Reynolds and R.C. Rose, “Robust Text-Independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech and Audio Processing, vol. 3, no. 1, pp. 72-83, Jan. 1995W.-K. Chen, Linear Networks and Systems. Belmont, Calif.: Wadsworth, pp. 123-135, 1993. (Book style) [5] Dr. H.B.Kekre, Archana Athawale,Deepali Sadavarti, “Algorithm to Generate Kekre’s Wavelet Transform from Kekre’s Transform”, International Journal of Engineering Science and Technology Vol. 2(5), 2010, 756-767. [6] Dr. H B Kekre , Vaishali Kulkarni, “Speaker Identification using Row Mean of Haar and Kekre’s Transform on Spectrograms of Different Frame Sizes”, (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Artificial Intelligence 93

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 1 |

posted: | 11/20/2012 |

language: | |

pages: | 7 |

OTHER DOCS BY iaemedu

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.