Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms
The International Journal of Computer Science and Information Security (IJCSIS) is a reputable venue for publishing novel ideas, state-of-the-art research results and fundamental advances in all aspects of computer science and information & communication security. IJCSIS is a peer reviewed international journal with a key objective to provide the academic and industrial community a medium for presenting original research and applications related to Computer Science and Information Security. . The core vision of IJCSIS is to disseminate new knowledge and technology for the benefit of everyone ranging from the academic and professional research communities to industry practitioners in a range of topics in computer science & engineering in general and information & communication security, mobile & wireless networking, and wireless communication systems. It also provides a venue for high-calibre researchers, PhD students and professionals to submit on-going research and developments in these areas. . IJCSIS invites authors to submit their original and unpublished work that communicates current research on information assurance and security regarding both the theoretical and methodological aspects, as well as various applications in solving real world information security problems. . Frequency of Publication: MONTHLY ISSN: 1947-5500 [Copyright � 2011, IJCSIS, USA]
- views:
- 211
- posted:
- 2/14/2011
- language:
- English
- pages:
- 6

(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
Comparative Analysis of Speaker Identification using
row mean of DFT, DCT, DST and Walsh Transforms
Dr. H B Kekre Vaishali Kulkarni
Senior Professor, Computer Department, Associate Professor, Electronics & Telecommunication,
MPSTME, NMIMS University, MPSTME, NMIMS University,
Mumbai, India Mumbai, India
hbkekre@yahoo.com Vaishalikulkarni6@yahoo.com
Abstract— In this paper we propose Speaker Identification using Although many new techniques have been developed,
four different Transform Techniques. The feature vectors are the widespread deployment of applications and services is still not
row mean of the transforms for different groupings. Experiments possible. None of these systems gives accurate and reliable
were performed on Discrete Fourier Transform (DFT), Discrete results. We When you open have proposed speaker recognition
Cosine Transform (DCT), Discrete Sine Transform (DST) and using vector quantization in time domain by using LBG (Linde
Walsh Transform (WHT). All the Transform give an accuracy of Buzo Gray), KFCG (Kekre’s Fast Codebook Generation) and
more than 80% for the different groupings considered. Accuracy KMCG (Kekre’s Median Codebook Generation) algorithms
increases as the number of samples grouped is increased from 64 [11], [12], [13] and in transform domain using DFT, DCT and
onwards. But for groupings more than 1024 the accuracy again
DST [14].
starts decreasing. The results show that DST performs best. The
maximum accuracy obtained for DST is 96% for a grouping of The concept of row mean of the transform techniques has
1024 samples while taking the transform. been used for content based image retrieval (CBIR) [15 – 18].
This technique also has been applied on speaker identification
Keywords - Euclidean distance, Row mean, Speaker Identification, by first converting the speech signal into a spectrogram [19].
Speaker Recognition
For the purposes of this paper, we will be considering a
speaker identification system that is text-dependent. For the
identification purpose, the feature vectors are extracted by
I. INTRODUCTION taking the row mean of the transforms (Which is a column
Human speech conveys an abundance of information, from the vector). The technique is used as shown in figure 1. Here a
language and gender to the identity of the person speaking. The speech signal of 15 samples is divided into 3 blocks of 5 each,
purpose of a speaker recognition system is thus to extract the and these 3 blocks form the columns of the matrix whose
transform is taken. Then the mean of the absolute value of each
unique characteristics of a speech signal that identify a
row of the transform matrix is taken and this forms the column
particular speaker. [1, 2, 3] Speaker recognition systems are
vector of mean.
usually classified into two subdivisions, speaker identification
and speaker verification. Speaker identification (also known as The rest of the paper is organized as follows: Section 2
closed set identification) is a 1: N matching process where the explains feature generation using the transform techniques,
identity of a person must be determined from a set of known Section 3 deals with Feature Matching, and the results are
speakers [3 - 5]. Speaker verification (also known as open set explained in Section 4 and the conclusion in section 5.
identification) serves to establish whether the speaker is who he
II. TRANSFORM TECHNIQUES
claims to be [6]. Speaker recognition can be further classified
into text-dependent and text-independent systems. In a text A. Discrete Fourier Transform
dependent system, the system knows what utterances to expect Spectral analysis is the process of identifying component
from the speaker. However, in a text-independent system, no frequencies in data. For discrete data, the computational basis
assumptions about the text can be made, and the system must be of spectral analysis is the discrete Fourier transform (DFT).
more flexible than a text dependent system. The DFT transforms time- or space-based data into frequency-
Speaker recognition technology has made it possible to use based data. The DFT allows you to efficiently estimate
the speaker's voice to control access to restricted services, for component frequencies in data from a discrete set of values
example, for giving commands to computer, phone access to sampled at a fixed rate. If the speech signal is represented by
banking, database services, shopping or voice mail, and access y(t), then the DFT of the time series or samples y0, y1,y2,
to secure equipment. Speaker Recognition systems have been
…..yN-1 is defined as:
developed for a wide range of applications [7 - 10].
102 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
-2jπkn/N C. Discrete Sine Transform
Yk = ne
(1) A discrete sine transform (DST) expresses a sequence of
finitely many data points in terms of a sum of sine functions.
Where yn=ys (nΔt); k= 0, 1, 2…, N-1.
Δt is the sampling interval.
)
Speech signal (1 × 15) (3)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Where y(k) is the sine transform, k=1,…, N.
D. Walsh Transform
Dividing
The Walsh transform or Walsh–Hadamard transform is a non-
into blocks
sinusoidal, orthogonal transformation technique that
of 5 Mean of decomposes a signal into a set of basis functions. These basis
Transform each row functions are Walsh functions, which are rectangular or square
1 6 11 C1 waves with values of +1 or –1. The Walsh–Hadamard
2 7 12 C2 transform returns sequency values. Sequency is a more
T C3 generalized notion of frequency and is defined as one half of
3 8 13
the average number of zero-crossings per unit time interval.
4 9 14 C4
Each Walsh function has a unique sequency value. You can
5 10 15 C5
use the returned sequency values to estimate the signal
Transform matrix Row Mean frequencies in the original signal. The Walsh–Hadamard
Speech signal transform is used in a number of applications, such as image
converted into (5 × 3) (1 × 5)
processing, speech processing, filtering, and power spectrum
matrix (5 × 3) analysis. It is very useful for reducing bandwidth storage
requirements and spread-spectrum analysis. Like the FFT, the
Figure 1. Row Mean Generation Technique Walsh–Hadamard transform has a fast version, the fast
Walsh–Hadamard transform (fwht). Compared to the FFT,
the FWHT requires less storage space and is faster to calculate
B. Discrete Cosine Transform because it uses only real additions and subtractions, while the
A discrete cosine transform (DCT) expresses a sequence of FFT requires complex values. The FWHT is able to represent
finitely many data points in terms of a sum of cosine functions signals with sharp discontinuities more accurately using fewer
oscillating at different frequencies. coefficients than the FFT. FWHTh is a divide and conquer
algorithm that recursively breaks down a WHT of size N into
two smaller WHTs of size N / 2. This implementation follows
the recursive definition of the Hadamard
(2) matrix HN:
Where y(k) is the cosine transform, k=1,…, N. (4)
The normalization factors for each stage may be grouped
k=1 together or even omitted. The Sequency ordered, also known
as Walsh ordered, fast Walsh–Hadamard transform, FWHT w,
is obtained by computing the FWHT h as above, and then
rearranging the outputs [23].
2≤k≤N
III. FEATURE EXTRACTION
The procedure for feature vector extraction is given below:
The DCT is closely related to the discrete Fourier transform. 1. The speech signal is divided into groups of n samples.
You can often reconstruct a sequence very accurately from (Where n can take values: 64, 128, 256, 512, 1024,
only a few DCT coefficients, a useful property for applications 2048, and 4096) samples.
requiring data reduction [20 – 22].
2. These blocks are then arranged as columns of a matrix
and then the different transforms given in section II are
taken.
103 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
3. The mean of the absolute values of the rows of the
transform matrix is then calculated.
IV. RESULTS
4. These row means form a column vector (1 × n where
A. Basics of speech signal
n is the number of rows in the transform matrix).
The speech samples used in this work are recorded using
5. This column vector forms the feature vector for Sound Forge 4.5. The sampling frequency is 8000 Hz (8 bit,
the speech sample. mono PCM samples). Table I shows the database description.
The samples are collected from different speakers. Samples are
6. The feature vectors for all the speech samples are taken from each speaker in two sessions so that training model
calculated for different values of n and stored in the and testing data can be created. Twelve samples per speaker are
database. taken. The samples recorded in one session are kept in database
Figure 2 shows the row mean generated for the four and the samples recorded in second session are used for testing.
transforms for a grouping of 64 samples for one of the
speech signal in the databases. These 64 row means
form the feature vector for the particular sample
considered. In a similar fashion, the feature vectors for
other speech signals were also calculated. This process
was repeated for all values of n. As can be seen from
figure 2, the 64 mean values form a 1×64 feature
vector.
Row Mean for DFT for a grouping of 64 samples Row mean for DCT for a grouping of 64 samples
3 0.4
2.5
0.3
2
Amplitude
Amplitude
1.5 0.2
1
0.1
0.5
0 0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Mean of the absolute value for each row of the transform matrix Row mean of the absolute value for each row of the Transform matrix
(A) (B)
Row mean for DST for a grouping of 64 samples Row Mean for Walsh for a grouping of 64 samples
2 0.04
1.5 0.03
Amplitude
Amplitude
1 0.02
0.5 0.01
0 0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Row mean of the absolute value for each row of the Transform matrix Mean of the absolute value for each row of the Transform matrix
104 http://sites.google.com/site/ijcsis/
(C)
ISSN 1947-5500
(D)
Figure 2. Row Mean Generation for a grouping of 64 samples for one of the speech signal
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
TABLE I. DATABASE DESCRIPTION TABLE II. NO. OF MATCHES FOR DIFFERENT GROUPINGS
Parameter Sample characteristics
Language English
No. of Speakers 105 No. of Number of matches (out of 105)
Speech type Read speech samples
Recording conditions Normal. (A silent room) grouped FFT DCT DST WALSH
Sampling frequency 8000 Hz
Resolution 8 bps 64 78 85 86 76
B. Expermental Results 128 87 92 98 79
The feature vectors of all the reference speech samples are
256 96 98 99 82
stored in the database in the training phase. In the matching
phase, the test sample that is to be identified is taken and 512 97 99 98 85
similarly processed as in the training phase to form the feature
vector. The stored feature vector which gives the minimum 1024 100 97 101 89
Euclidean distance with the input sample feature vector is
declared as the speaker identified. 2048 100 96 97 85
Table II gives the number of matches for the four different 4096 98 96 99 83
transforms. The matching has been calculated by considering
the minimum Euclidean distance between the feature vector of 8192 96 90 90 67
the test speech signal and the feature vector of the speech
signals stored in the database. The rows of Table II show the
number of samples of each speech signal grouped together to
form the columns of a matrix whose transform is then taken. C. Accuracy of Identification
For each grouping, the transform which gives maximum
The accuracy of the identification system is calculated as
matches has been shaded in yellow. We can see that for
given by equation 5.
groupings of 64, 128 and 256 DST gives the best matching i.e.
86, 98 and 99 (out of 105) respectively. For a grouping of 512, (5)
DCT gives best matching i.e. 99. For a grouping of 1024
samples, DST gives maximum matches i.e. 101. It can also be The accuracy for the different groupings of the four transforms
seen that as the number of samples grouped is further increased was calculated and is shown in Figure 3.
beyond 1024, the number of matches is reduced for all the
transforms.
105
Figure 3. Accuracy for the four transforms by varying the groupings of samples http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
[12] H B Kekre, Vaishali Kulkarni, “Performance Comparison of Speaker
Recognition using Vector Quantization by LBG and KFCG ” ,
The results show the accuracy increases as we increase the International Journal of Computer Applications, vol. 3, July 2010.
feature vector size from 64 to 512 for the transforms. Only for [13] H B Kekre, Vaishali Kulkarni, “ Performance Comparison of
DST, the accuracy decreases from 94.28% to 93.33% as we Automatic Speaker Recognition using Vector Quantization by LBG
increase the feature vector size from 256 to 512. The feature KFCG and KMCG”, International Journal of Computer Science and
vector size of 1024 gives the best result for all the transforms Security, Vol: 4 Issue: 4, 2010.
except DCT. For DCT, the best result is obtained for a feature [14] H B Kekre, Vaishali Kulkarni, “Comparative Analysis of Automatic
Speaker Recognition using Kekre’s Fast Codebook Generation
vector size of 512. For DFT, the maximum accuracy obtained Algorithm in Time Domain and Transform Domain ” , International
is 95.2381% for a feature vector size of 1024. Walsh transform Journal of Computer Applications, Volume 7 No.1. September 2010.
gives a maximum accuracy of around 84.7619%. DST [15] Dr. H.B.Kekre, Sudeep D. Thepade, Akshay Maloo “Performance
performs best giving a maximum accuracy of 96.1905% for a Comparision of Image Retrieval using Row Mean of Transformed
feature vector size of 1024. Column Image”, International Journal on Computer Science and
Engineering Vol. 02, No. 05, 2010, 1908-1912
V. CONCLUSION [16] Dr.H.B.Kekre,Sudeep Thepade “Edge Texture Based CBIR using Row
Mean of Transformed Column Gradient Image”, International Journal of
In this paper we have compared the performance of four Computer Applications (0975 – 8887) Volume 7– No.10, October 2010
different transforms for speaker identification. All the [17] Dr. H.B.Kekre, Sudeep D. Thepade, Akshay Maloo “Eigenvectors of
Transforms give an accuracy of more than 80% for the feature Covariance Matrix using Row Mean and Column Mean Sequences for
Face Recognition”, International Journal of Biometrics and
vector size considered. Accuracy increases as the feature Bioinformatics (IJBB), Volume (4): Issue (2)
vector size is increased from 64 onwards. But for feature [18] Dr. H.B.Kekre, Sudeep Thepade, Archana Athawale, “Grayscale Image
vector size of more than 1024 the accuracy again starts Retrieval using DCT on Row mean, Column mean and Combination”,
decreasing. The results show that DST performs best. The Journal of Sci., Engg. & Tech. Mgt. Vol 2 (1), January 2010
maximum accuracy obtained for DST is around 96% for a [19] Dr. H. B. Kekre, Dr. T. K. Sarode, Shachi J. Natu, Prachi J. Natu
feature vector size of 1024. The present study is ongoing and “Performance Comparison of Speaker Identification Using DCT, Walsh,
Haar on Full and Row Mean of Spectrogram”, International Journal of
we are analyzing the performance on other transforms. Computer Applications (0975 – 8887) Volume 5– No.6, August 2010
[20] N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete Cosine Transform",
IEEE Trans. Computers, 90-93, Jan 1974.
REFERENCES [21] N. Ahmed, "How I came up with the Discrete Cosine Transform",
[1] Lawrence Rabiner, Biing-Hwang Juang and B.Yegnanarayana, Digital Signal Processing, Vol. 1,1991.
“Fundamental of Speech Recognition”, Prentice-Hall, Englewood Cliffs, [22] G. Strang, “The Discrete Cosine Transform,” SIAM Review, Volume
2009. 41, Number 1,1999.
[2] S Furui, “50 years of progress in speech and speaker recognition [23] Fino, B.J., and Algazi, V.R., 1976, "Unified Matrix Treatment of the
research”, ECTI Transactions on Computer andInformation Technology, Fast Walsh–Hadamard Transform," IEEE Transactions on Computers
Vol. 1, No. 2, November 2005. 25: 1142–1146
[3] D. A. Reynolds, “An overview of automatic speaker recognition
technology,” Proc. IEEE Int. Conf. Acoust., Speech,S
[4] Joseph P. Campbell, Jr., Senior Member, IEEE, “Speaker Recognition:
A Tutorial”, Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462,
September 1997.
[5] S. Furui. Recent advances in speaker recognition. AVBPA97, pp 237--
AUTHORS PROFILE
251, 1997
[6] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-
Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D.Petrovska-
Delacrétaz, and D. A. Reynolds, “A tutorial on text-independent speaker Dr. H. B. Kekre has received B.E. (Hons.) in
verification,” EURASIP J. Appl. Signal Process., vol. 2004, no. 1, pp. Telecomm. Engg. from Jabalpur University in
430–451, 2004. 1958, M.Tech (Industrial Electronics) from IIT
[7] D. A. Reynolds, “Experimental evaluation of features for robust speaker Bombay in 1960, M.S.Engg. (Electrical Engg.)
identification,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. from University of Ottawa in 1965 and Ph.D.
639–643, Oct. 1994. (System Identification) from IIT Bombay in
1970. He has worked Over 35 years as Faculty of
[8] Tomi Kinnunen, Evgeny Karpov, and Pasi Fr¨anti, “Realtime Speaker
Identification”, ICSLP2004. Electrical Engineering and then HOD Computer
Science and Engg. at IIT Bombay. For last 13 years worked as a Professor in
[9] Marco Grimaldi and Fred Cummins, “Speaker Identification using Department of Computer Engg. at Thadomal Shahani Engineering College,
Instantaneous Frequencies”, IEEE Transactions on Audio, Speech, and Mumbai. He is currently Senior Professor working with Mukesh Patel School
Language Processing, vol., 16, no. 6, August 2008. of Technology Management and Engineering, SVKM’s NMIMS University,
[10] Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi, Yu. (1999). “Binary Vile Parle(w), Mumbai, INDIA. He ha guided 17 Ph.D.s, 150 M.E./M.Tech
Quantization of Feature vectors for robust text-independent Speaker Projects and several B.E./B.Tech Projects. His areas of interest are Digital
Identification” in IEEE Transactions on Speech and Audio Processing Signal processing, Image Processing and Computer Networks. He has more
Vol. 7, No. 1, January 1999. IEEE, New York, NY, U.S.A. than 300 papers in National / International Conferences / Journals to his credit.
[11] H B Kekre, Vaishali Kulkarni, “Speaker Identification by using Vector Recently twelve students working under his guidance have received best paper
Quantization”, International Journal of Engineering Science and awards. Recently two research scholars have received Ph. D. degree from
Technology, May 2010. NMIMS University Currently he is guiding ten Ph.D. students. He is member
of ISTE and IETE.
106 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 1, 2011
Vaishali Kulkarni has received B.E in Electronics Engg.
from Mumbai University in 1997, M.E
(Electronics and Telecom) from Mumbai
University in 2006. Presently she is pursuing
Ph. D from NMIMS University. She has a
teaching experience of more than 8 years.
She is Associate Professor in telecom
Department in MPSTME, NMIMS University. Her areas of
interest include Speech processing: Speech and Speaker
Recognition. She has 8 papers in National / International
Conferences / Journals to her credit.
107 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Get documents about "