Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms by ijcsis


More Info
									                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 9, No. 1, 2011

Comparative Analysis of Speaker Identification using
row mean of DFT, DCT, DST and Walsh Transforms

                      Dr. H B Kekre                                                             Vaishali Kulkarni
          Senior Professor, Computer Department,                              Associate Professor, Electronics & Telecommunication,
              MPSTME, NMIMS University,                                                  MPSTME, NMIMS University,
                      Mumbai, India                                                               Mumbai, India

Abstract— In this paper we propose Speaker Identification using               Although many new techniques have been developed,
four different Transform Techniques. The feature vectors are the          widespread deployment of applications and services is still not
row mean of the transforms for different groupings. Experiments           possible. None of these systems gives accurate and reliable
were performed on Discrete Fourier Transform (DFT), Discrete              results. We When you open have proposed speaker recognition
Cosine Transform (DCT), Discrete Sine Transform (DST) and                 using vector quantization in time domain by using LBG (Linde
Walsh Transform (WHT). All the Transform give an accuracy of              Buzo Gray), KFCG (Kekre’s Fast Codebook Generation) and
more than 80% for the different groupings considered. Accuracy            KMCG (Kekre’s Median Codebook Generation) algorithms
increases as the number of samples grouped is increased from 64           [11], [12], [13] and in transform domain using DFT, DCT and
onwards. But for groupings more than 1024 the accuracy again
                                                                          DST [14].
starts decreasing. The results show that DST performs best. The
maximum accuracy obtained for DST is 96% for a grouping of                    The concept of row mean of the transform techniques has
1024 samples while taking the transform.                                  been used for content based image retrieval (CBIR) [15 – 18].
                                                                          This technique also has been applied on speaker identification
Keywords - Euclidean distance, Row mean, Speaker Identification,          by first converting the speech signal into a spectrogram [19].
Speaker Recognition
                                                                              For the purposes of this paper, we will be considering a
                                                                          speaker identification system that is text-dependent. For the
                                                                          identification purpose, the feature vectors are extracted by
                      I.    INTRODUCTION                                  taking the row mean of the transforms (Which is a column
Human speech conveys an abundance of information, from the                vector). The technique is used as shown in figure 1. Here a
language and gender to the identity of the person speaking. The           speech signal of 15 samples is divided into 3 blocks of 5 each,
purpose of a speaker recognition system is thus to extract the            and these 3 blocks form the columns of the matrix whose
                                                                          transform is taken. Then the mean of the absolute value of each
unique characteristics of a speech signal that identify a
                                                                          row of the transform matrix is taken and this forms the column
particular speaker. [1, 2, 3] Speaker recognition systems are
                                                                          vector of mean.
usually classified into two subdivisions, speaker identification
and speaker verification. Speaker identification (also known as              The rest of the paper is organized as follows: Section 2
closed set identification) is a 1: N matching process where the           explains feature generation using the transform techniques,
identity of a person must be determined from a set of known               Section 3 deals with Feature Matching, and the results are
speakers [3 - 5]. Speaker verification (also known as open set            explained in Section 4 and the conclusion in section 5.
identification) serves to establish whether the speaker is who he
                                                                                          II.   TRANSFORM TECHNIQUES
claims to be [6]. Speaker recognition can be further classified
into text-dependent and text-independent systems. In a text               A. Discrete Fourier Transform
dependent system, the system knows what utterances to expect              Spectral analysis is the process of identifying component
from the speaker. However, in a text-independent system, no               frequencies in data. For discrete data, the computational basis
assumptions about the text can be made, and the system must be            of spectral analysis is the discrete Fourier transform (DFT).
more flexible than a text dependent system.                               The DFT transforms time- or space-based data into frequency-
    Speaker recognition technology has made it possible to use            based data. The DFT allows you to efficiently estimate
the speaker's voice to control access to restricted services, for         component frequencies in data from a discrete set of values
example, for giving commands to computer, phone access to                 sampled at a fixed rate. If the speech signal is represented by
banking, database services, shopping or voice mail, and access            y(t), then the DFT of the time series or samples y0, y1,y2,
to secure equipment. Speaker Recognition systems have been
                                                                          …..yN-1 is defined as:
developed for a wide range of applications [7 - 10].

                                                                                                     ISSN 1947-5500
                                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                    Vol. 9, No. 1, 2011

                                         -2jπkn/N                                     C. Discrete Sine Transform
                        Yk =           ne
                                                                          (1)         A discrete sine transform (DST) expresses a sequence of
                                                                                      finitely many data points in terms of a sum of sine functions.
                 Where yn=ys (nΔt); k= 0, 1, 2…, N-1.
    Δt is the sampling interval.
             Speech signal (1 × 15)                                                                                                                     (3)
1    2   3    4    5    6    7     8   9     10     11   12    13   14    15
                                                                                      Where y(k) is the sine transform, k=1,…, N.

                                                                                      D. Walsh Transform
                                                                                      The Walsh transform or Walsh–Hadamard transform is a non-
                  into blocks
                                                                                      sinusoidal, orthogonal transformation technique that
                      of 5                               Mean of                      decomposes a signal into a set of basis functions. These basis
                        Transform                        each row                     functions are Walsh functions, which are rectangular or square
     1       6     11                                            C1                   waves with values of +1 or –1. The Walsh–Hadamard
     2       7     12                                            C2                   transform returns sequency values. Sequency is a more
                               T                                 C3                   generalized notion of frequency and is defined as one half of
     3       8     13
                                                                                      the average number of zero-crossings per unit time interval.
     4       9     14                                            C4
                                                                                      Each Walsh function has a unique sequency value. You can
     5       10    15                                            C5
                                                                                      use the returned sequency values to estimate the signal
                                   Transform matrix           Row Mean                frequencies in the original signal. The Walsh–Hadamard
     Speech signal                                                                    transform is used in a number of applications, such as image
     converted into                (5 × 3)                      (1 × 5)
                                                                                      processing, speech processing, filtering, and power spectrum
      matrix (5 × 3)                                                                  analysis. It is very useful for reducing bandwidth storage
                                                                                      requirements and spread-spectrum analysis. Like the FFT, the
                       Figure 1. Row Mean Generation Technique                        Walsh–Hadamard transform has a fast version, the fast
                                                                                      Walsh–Hadamard transform (fwht). Compared to the FFT,
                                                                                      the FWHT requires less storage space and is faster to calculate
    B. Discrete Cosine Transform                                                      because it uses only real additions and subtractions, while the
        A discrete cosine transform (DCT) expresses a sequence of                     FFT requires complex values. The FWHT is able to represent
    finitely many data points in terms of a sum of cosine functions                   signals with sharp discontinuities more accurately using fewer
    oscillating at different frequencies.                                             coefficients than the FFT. FWHTh is a divide and conquer
                                                                                      algorithm that recursively breaks down a WHT of size N into
                                                                                      two smaller WHTs of size N / 2. This implementation follows
                                                                                      the recursive definition of the                      Hadamard
                                                                         (2)          matrix HN:

    Where y(k) is the cosine transform, k=1,…, N.                                                                                            (4)
                                                                                      The       normalization factors for each stage may be grouped
                                           k=1                                        together or even omitted. The Sequency ordered, also known
                                                                                      as Walsh ordered, fast Walsh–Hadamard transform, FWHT w,
                                                                                      is obtained by computing the FWHT h as above, and then
                                                                                      rearranging the outputs [23].
                                                                                                       III.   FEATURE EXTRACTION
                                                                                         The procedure for feature vector extraction is given below:
    The DCT is closely related to the discrete Fourier transform.                        1.   The speech signal is divided into groups of n samples.
    You can often reconstruct a sequence very accurately from                                 (Where n can take values: 64, 128, 256, 512, 1024,
    only a few DCT coefficients, a useful property for applications                           2048, and 4096) samples.
    requiring data reduction [20 – 22].
                                                                                         2.   These blocks are then arranged as columns of a matrix
                                                                                              and then the different transforms given in section II are

                                                                                                                 ISSN 1947-5500
                                                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                                  Vol. 9, No. 1, 2011

3.   The mean of the absolute values of the rows of the
     transform matrix is then calculated.
                                                                                                                                         IV.      RESULTS
4.   These row means form a column vector (1 × n where
                                                                                                  A. Basics of speech signal
     n is the number of rows in the transform matrix).
                                                                                                      The speech samples used in this work are recorded using
5.   This column vector forms the feature vector for                                              Sound Forge 4.5. The sampling frequency is 8000 Hz (8 bit,
     the speech sample.                                                                           mono PCM samples). Table I shows the database description.
                                                                                                  The samples are collected from different speakers. Samples are
6.   The feature vectors for all the speech samples are                                           taken from each speaker in two sessions so that training model
     calculated for different values of n and stored in the                                       and testing data can be created. Twelve samples per speaker are
     database.                                                                                    taken. The samples recorded in one session are kept in database
     Figure 2 shows the row mean generated for the four                                           and the samples recorded in second session are used for testing.
     transforms for a grouping of 64 samples for one of the
     speech signal in the databases. These 64 row means
     form the feature vector for the particular sample
     considered. In a similar fashion, the feature vectors for
     other speech signals were also calculated. This process
     was repeated for all values of n. As can be seen from
     figure 2, the 64 mean values form a 1×64 feature

                   Row Mean for DFT for a grouping of 64 samples                                                    Row mean for DCT for a grouping of 64 samples
                  3                                                                                               0.4


                 1.5                                                                                              0.2

                  0                                                                                                 0
                   0        10        20        30        40        50        60       70                            0       10        20        30        40        50        60       70
                       Mean of the absolute value for each row of the transform matrix                                Row mean of the absolute value for each row of the Transform matrix

                                            (A)                                                                                                     (B)

                  Row mean for DST for a grouping of 64 samples                                                    Row Mean for Walsh for a grouping of 64 samples
                  2                                                                                               0.04

                 1.5                                                                                              0.03


                  1                                                                                               0.02

                 0.5                                                                                              0.01

                  0                                                                                                 0
                   0       10        20        30        40        50        60       70                             0        10        20        30         40        50       60       70
                    Row mean of the absolute value for each row of the Transform matrix                                  Mean of the absolute value for each row of the Transform matrix

                                                                                                                                               ISSN 1947-5500
                                                    Figure 2. Row Mean Generation for a grouping of 64 samples for one of the speech signal
                                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                            Vol. 9, No. 1, 2011

              TABLE I.       DATABASE DESCRIPTION                                  TABLE II.        NO. OF MATCHES FOR DIFFERENT GROUPINGS
              Parameter             Sample characteristics
      Language                   English
      No. of Speakers            105                                       No. of            Number of matches (out of 105)
      Speech type                Read speech                               samples
      Recording conditions       Normal. (A silent room)                   grouped           FFT             DCT             DST           WALSH
      Sampling frequency         8000 Hz
      Resolution                 8 bps                                                 64              78              85            86             76

B. Expermental Results                                                                128              87              92            98             79
    The feature vectors of all the reference speech samples are
                                                                                      256              96              98            99             82
stored in the database in the training phase. In the matching
phase, the test sample that is to be identified is taken and                          512              97              99            98             85
similarly processed as in the training phase to form the feature
vector. The stored feature vector which gives the minimum                            1024             100              97           101             89
Euclidean distance with the input sample feature vector is
declared as the speaker identified.                                                  2048             100              96            97             85

    Table II gives the number of matches for the four different                      4096              98              96            99             83
transforms. The matching has been calculated by considering
the minimum Euclidean distance between the feature vector of                         8192              96              90            90             67
the test speech signal and the feature vector of the speech
signals stored in the database. The rows of Table II show the
number of samples of each speech signal grouped together to
form the columns of a matrix whose transform is then taken.                C. Accuracy of Identification
For each grouping, the transform which gives maximum
                                                                           The accuracy of the identification system is calculated as
matches has been shaded in yellow. We can see that for
                                                                           given by equation 5.
groupings of 64, 128 and 256 DST gives the best matching i.e.
86, 98 and 99 (out of 105) respectively. For a grouping of 512,                                                                                     (5)
DCT gives best matching i.e. 99. For a grouping of 1024
samples, DST gives maximum matches i.e. 101. It can also be                The accuracy for the different groupings of the four transforms
seen that as the number of samples grouped is further increased            was calculated and is shown in Figure 3.
beyond 1024, the number of matches is reduced for all the

                              Figure 3. Accuracy for the four transforms by varying the groupings of samples
                                                                                                             ISSN 1947-5500
                                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                  Vol. 9, No. 1, 2011

                                                                                     [12] H B Kekre, Vaishali Kulkarni, “Performance Comparison of Speaker
                                                                                          Recognition using Vector Quantization by LBG and KFCG ” ,
The results show the accuracy increases as we increase the                                International Journal of Computer Applications, vol. 3, July 2010.
feature vector size from 64 to 512 for the transforms. Only for                      [13] H B Kekre, Vaishali Kulkarni, “ Performance Comparison of
DST, the accuracy decreases from 94.28% to 93.33% as we                                   Automatic Speaker Recognition using Vector Quantization by LBG
increase the feature vector size from 256 to 512. The feature                             KFCG and KMCG”, International Journal of Computer Science and
vector size of 1024 gives the best result for all the transforms                          Security, Vol: 4 Issue: 4, 2010.
except DCT. For DCT, the best result is obtained for a feature                       [14] H B Kekre, Vaishali Kulkarni, “Comparative Analysis of Automatic
                                                                                          Speaker Recognition using Kekre’s Fast Codebook Generation
vector size of 512. For DFT, the maximum accuracy obtained                                Algorithm in Time Domain and Transform Domain ” , International
is 95.2381% for a feature vector size of 1024. Walsh transform                            Journal of Computer Applications, Volume 7 No.1. September 2010.
gives a maximum accuracy of around 84.7619%. DST                                     [15] Dr. H.B.Kekre, Sudeep D. Thepade, Akshay Maloo “Performance
performs best giving a maximum accuracy of 96.1905% for a                                 Comparision of Image Retrieval using Row Mean of Transformed
feature vector size of 1024.                                                              Column Image”, International Journal on Computer Science and
                                                                                          Engineering Vol. 02, No. 05, 2010, 1908-1912
                           V.     CONCLUSION                                         [16] Dr.H.B.Kekre,Sudeep Thepade “Edge Texture Based CBIR using Row
                                                                                          Mean of Transformed Column Gradient Image”, International Journal of
In this paper we have compared the performance of four                                    Computer Applications (0975 – 8887) Volume 7– No.10, October 2010
different transforms for speaker identification. All the                             [17] Dr. H.B.Kekre, Sudeep D. Thepade, Akshay Maloo “Eigenvectors of
Transforms give an accuracy of more than 80% for the feature                              Covariance Matrix using Row Mean and Column Mean Sequences for
                                                                                          Face Recognition”, International Journal of Biometrics and
vector size considered. Accuracy increases as the feature                                 Bioinformatics (IJBB), Volume (4): Issue (2)
vector size is increased from 64 onwards. But for feature                            [18] Dr. H.B.Kekre, Sudeep Thepade, Archana Athawale, “Grayscale Image
vector size of more than 1024 the accuracy again starts                                   Retrieval using DCT on Row mean, Column mean and Combination”,
decreasing. The results show that DST performs best. The                                  Journal of Sci., Engg. & Tech. Mgt. Vol 2 (1), January 2010
maximum accuracy obtained for DST is around 96% for a                                [19] Dr. H. B. Kekre, Dr. T. K. Sarode, Shachi J. Natu, Prachi J. Natu
feature vector size of 1024. The present study is ongoing and                             “Performance Comparison of Speaker Identification Using DCT, Walsh,
                                                                                          Haar on Full and Row Mean of Spectrogram”, International Journal of
we are analyzing the performance on other transforms.                                     Computer Applications (0975 – 8887) Volume 5– No.6, August 2010
                                                                                     [20] N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete Cosine Transform",
                                                                                          IEEE Trans. Computers, 90-93, Jan 1974.
                                REFERENCES                                           [21] N. Ahmed, "How I came up with the Discrete Cosine Transform",
[1]  Lawrence Rabiner, Biing-Hwang Juang and B.Yegnanarayana,                             Digital Signal Processing, Vol. 1,1991.
     “Fundamental of Speech Recognition”, Prentice-Hall, Englewood Cliffs,           [22] G. Strang, “The Discrete Cosine Transform,” SIAM Review, Volume
     2009.                                                                                41, Number 1,1999.
[2] S Furui, “50 years of progress in speech and speaker recognition                 [23] Fino, B.J., and Algazi, V.R., 1976, "Unified Matrix Treatment of the
     research”, ECTI Transactions on Computer andInformation Technology,                  Fast Walsh–Hadamard Transform," IEEE Transactions on Computers
     Vol. 1, No. 2, November 2005.                                                        25: 1142–1146
[3] D. A. Reynolds, “An overview of automatic speaker recognition
     technology,” Proc. IEEE Int. Conf. Acoust., Speech,S
[4] Joseph P. Campbell, Jr., Senior Member, IEEE, “Speaker Recognition:
     A Tutorial”, Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462,
     September 1997.
[5] S. Furui. Recent advances in speaker recognition. AVBPA97, pp 237--
                                                                                                                 AUTHORS PROFILE
     251, 1997
[6] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-
     Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D.Petrovska-
     Delacrétaz, and D. A. Reynolds, “A tutorial on text-independent speaker                                     Dr. H. B. Kekre has received B.E. (Hons.) in
     verification,” EURASIP J. Appl. Signal Process., vol. 2004, no. 1, pp.                                      Telecomm. Engg. from Jabalpur University in
     430–451, 2004.                                                                                              1958, M.Tech (Industrial Electronics) from IIT
[7] D. A. Reynolds, “Experimental evaluation of features for robust speaker                                      Bombay in 1960, M.S.Engg. (Electrical Engg.)
     identification,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp.                                      from University of Ottawa in 1965 and Ph.D.
     639–643, Oct. 1994.                                                                                         (System Identification) from IIT Bombay in
                                                                                                                 1970. He has worked Over 35 years as Faculty of
[8] Tomi Kinnunen, Evgeny Karpov, and Pasi Fr¨anti, “Realtime Speaker
     Identification”, ICSLP2004.                                                                                 Electrical Engineering and then HOD Computer
                                                                                     Science and Engg. at IIT Bombay. For last 13 years worked as a Professor in
[9] Marco Grimaldi and Fred Cummins, “Speaker Identification using                   Department of Computer Engg. at Thadomal Shahani Engineering College,
     Instantaneous Frequencies”, IEEE Transactions on Audio, Speech, and             Mumbai. He is currently Senior Professor working with Mukesh Patel School
     Language Processing, vol., 16, no. 6, August 2008.                              of Technology Management and Engineering, SVKM’s NMIMS University,
[10] Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi, Yu. (1999). “Binary                 Vile Parle(w), Mumbai, INDIA. He ha guided 17 Ph.D.s, 150 M.E./M.Tech
     Quantization of Feature vectors for robust text-independent Speaker             Projects and several B.E./B.Tech Projects. His areas of interest are Digital
     Identification” in IEEE Transactions on Speech and Audio Processing             Signal processing, Image Processing and Computer Networks. He has more
     Vol. 7, No. 1, January 1999. IEEE, New York, NY, U.S.A.                         than 300 papers in National / International Conferences / Journals to his credit.
[11] H B Kekre, Vaishali Kulkarni, “Speaker Identification by using Vector           Recently twelve students working under his guidance have received best paper
     Quantization”, International Journal of Engineering Science and                 awards. Recently two research scholars have received Ph. D. degree from
     Technology, May 2010.                                                           NMIMS University Currently he is guiding ten Ph.D. students. He is member
                                                                                     of ISTE and IETE.

                                                                                                                       ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 9, No. 1, 2011

Vaishali Kulkarni has received B.E in Electronics Engg.
               from Mumbai University in 1997, M.E
               (Electronics and Telecom) from Mumbai
               University in 2006. Presently she is pursuing
               Ph. D from NMIMS University. She has a
               teaching experience of more than 8 years.
               She is Associate Professor in telecom
Department in MPSTME, NMIMS University. Her areas of
interest include Speech processing: Speech and Speaker
Recognition. She has 8 papers in National / International
Conferences / Journals to her credit.

                                                                                               ISSN 1947-5500

To top