; Performance Comparison of Speaker Identification using circular DFT and WHT Sectors
Learning Center
Plans & pricing Sign in
Sign Out

Performance Comparison of Speaker Identification using circular DFT and WHT Sectors


  • pg 1
									                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 9, No. 3, 2011

    Performance Comparison of Speaker Identification
          using circular DFT and WHT Sectors
                                             Dr. H B Kekre1, Vaishali Kulkarni2,
                                Indraneal Balasubramanian3, Abhimanyu Gehlot4, Rasik Srinath5
                                  Senior Professor, Computer Dept., MPSTME, NMIMS University.
                                       Associate Professor, EXTC Dept., MPSTME, NMIMS University.
                                        3, 4, 5
                                                students, B-Tech EXTC, MPSTME, NMIMS University.
                            indraneal89@gmail.com, abhimanyu13090@gmail.com, rasik90@gmail.com

Abstract— In this paper we aim to provide a unique approach to             identification using power distribution in the frequency domain
text dependent speaker identification using transform techniques           [11], [12]. We have also proposed speaker recognition using
such as DFT (Discrete Fourier Transform) and WHT (Walsh                    vector quantization in time domain by using LBG (Linde Buzo
Hadamard Transform). In the first method, the feature vectors              Gray), KFCG (Kekre’s Fast Codebook Generation) and KMCG
are extracted by dividing the complex DFT spectrum into                    (Kekre’s Median Codebook Generation) algorithms [13 – 15]
circular sectors and then taking the weighted density count of the         and in transform domain using DFT (Discrete Fourier
number of points in each of these sectors. In the second method,           Transform), DCT (Discrete Cosine Transform) and DST
the feature vectors are extracted by dividing the WHT spectrum             (Discrete Sine Transform) [16].
into circular sectors and then again taking the weighted density
count of the number of points in each of these sectors. Further,               The concept of sectorization has been used for (CBIR)
comparison of the two transforms shows that the accuracy                   content based image retrieval. [17] – [21]. We have proposed
obtained for DFT is more (80%) than that obtained for WHT                  speaker identification using circular DFT sectors [22]. In this
(66%).                                                                     paper, we propose speaker identification using WHT (Walsh
                                                                           Hadamard Transform), and also compare the results with DFT
   Keywords - Speaker identification; Circular Sectors; weighted           sectors. In Fig. 1, we can see how a basic speaker identification
density; Euclidean distance                                                system operates. A number of speech samples are collected
                                                                           from a variety of speakers, and then their features are extracted
                       I.    INTRODUCTION                                  and stored as reference models in a database. When a speaker is
    Human speech conveys an abundance of information, from                 to be identified, the features of his speech are extracted and
the language and gender to the identity of the person speaking.            compared with all of the reference speaker models. The
The purpose of a speaker recognition system is thus to extract             reference model which gives the minimum Euclidean distance
the unique characteristics of a speech signal that identify a              with the feature vector of the person to be identified is the
particular speaker [1 - 4]. Speaker recognition systems are                maximum likelihood model and is declared as the person
usually classified into two subdivisions, speaker identification           identified.
and speaker verification [2 – 5]. Speaker identification (also
known as closed set identification) is a 1: N matching process                                             II.
where the identity of a person must be determined from a set of
known speakers [7]. Speaker verification (also known as open                                               III.
set identification) serves to establish whether the speaker is
who he claims to be [8]. Speaker identification can be further                                             IV.
classified into text-dependent and text-independent systems. In
a text dependent system, the system knows what utterances to                                               V.
expect from the speaker. However, in a text-independent
system, no assumptions about the text can be made, and the
system must be more flexible than a text dependent system [4,                                              VI.
5, and 8].
                                                                                                  VII. EASE OF USE
    Speaker recognition systems find use in a multitude of
applications today including automated call processing in                  A. Selecting a Template (Heading 2)
telephone networks as well as query systems such as stock                     FF
information, weather reports etc. However, difficulties in wide
deployment of such systems are a practical limitation that is yet
to be overcome [2, 6, 7, 9, and 10]. We have proposed speaker                                Figure 1. Speaker Identification System

                                                                     139                                http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                      Vol. 9, No. 3, 2011
                                                                              II.      SECTORIZATION OF THE COMPLEX TRANSFORM PLANES
                                                                             The speech signal has amplitude range from -1 to +1. It is
A. Discrete Fourier Transform(DFT)                                       first converted into positive values by adding +1 to all the
     The DFT transforms time or space based data into                    sample values. Thus the amplitude range of the speech signal is
frequency-based data. The DFT allows you to efficiently                  now from 0 to 2. For sectorization two methods are used,
estimate component frequencies in data from a discrete set of            which are described below:
values sampled at a fixed rate [23, 24]. If the speech signal is
represented by y (t), then the DFT of the time series or                 A. DFT Sectorization
samples y0, y1,y2, …..yN-1 is defined as given by (1):                      The algorithm for DFT sectorization is given below:
                                                                         1. The DFT of the speech signal is computed. Since the DFT
                       Yk =           ne
                                                                            is symmetrical, only half of the number of points in the
                                                                            DFT is considered while drawing the complex DFT plane
                                                                            (i.e. Yreal vs. Yimag).
             Where yn=ys (nΔt); k= 0, 1, 2…, N-1.
   Δt is the sampling interval.                                          2.         Also the first point in DFT is a real number, so it is
                                                                                    considered separately while taking feature vectors. So the
B. Walsh Hadamard Transform                                                         complex plane is only from (2, N/2), where N is the
     The Walsh transform or Walsh–Hadamard transform is a                           number of points in DFT. Fig. 2 shows the original speech
non-sinusoidal, orthogonal transformation technique that                            signal and its complex DFT plane for one of the samples
decomposes a signal into a set of basis functions. These basis                      in the database.
functions are Walsh functions, which are rectangular or square
waves with values of +1 or –1. The Walsh–Hadamard                        3.         For dividing the complex plane into sectors, the
transform returns sequency values. Sequency is a more                               magnitude of the DFT is considered as the radius of the
generalized notion of frequency and is defined as one half of                       circular sector as in (3):
the average number of zero-crossings per unit time interval.
Each Walsh function has a unique sequency value. You can                              Radius (R) = abs (sqrt ((Yreal)2+(Yimag)2))                                                                  (3)
use the returned sequency values to estimate the signal
frequencies in the original signal. The Walsh–Hadamard                   4.         Table I shows the range of the radius taken for dividing
transform is used in a number of applications, such as image                        the DFT plane into circular sectors.
processing, speech processing, filtering, and power spectrum
analysis. It is very useful for reducing bandwidth storage                                             1

requirements and spread-spectrum analysis [25]. Like the FFT,                                        0.5
the Walsh–Hadamard transform has a fast version, the fast

Walsh–Hadamard transform (fwht). Compared to the FFT,                                                  0

the FWHT requires less storage space and is faster to calculate
because it uses only real additions and subtractions, while the
FFT requires complex values. The FWHT is able to represent                                            -1
                                                                                                        0              0.5          1     1.5       2     2.5     3   3.5    4    4.5          5
signals with sharp discontinuities more accurately using fewer                                                                                     No. of samples                          4
                                                                                                                                                                                        x 10

coefficients than the FFT. FWHTh is a divide and conquer                                                        400

algorithm that recursively breaks down a WHT of size N into
two smaller WHTs of size N / 2. This implementation follows
the recursive definition of the           Hadamard matrix HN                                                    200

given by (2):                                                                                                   100



   The         normalization factors for each stage may be                                                      -200

grouped together or even omitted. The Sequency ordered, also                                                    -300

known as Walsh ordered, fast Walsh–Hadamard transform,
FWHTw, is obtained by computing the FWHTh as above, and                                                            -400      -300       -200     -100     0
                                                                                                                                                                100   200   300   400

then rearranging the outputs.
    The rest of the paper is organized as follows: Section II
explains the sectorization process, Section III explains the                                            Figure 2. Speech signal and its complex DFT plane
feature extraction using the density of the samples in each of
the sectors, Section IV deals with Feature Matching, and results         5.         The maximum range of the radius for forming the sectors
are explained in Section V and the conclusion in section VI.                        was found by experimenting on the different samples in

   Identify applicable sponsor/s here. (sponsors)

                                                                   140                                                                          http://sites.google.com/site/ijcsis/
                                                                                                                                                ISSN 1947-5500
                                                                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                                                           Vol. 9, No. 3, 2011
     the database. Various combinations of the range were
     tried and the values given in Table I was found to be                                                                            300

     satisfactory. Fig. 3 shows the seven sectors formed for the
     complex plane shown in Fig. 2. Different colours have                                                                            200

     been used to show the different sectors.

6.   The seven circular sectors were further divided into four                                                                        100

     quadrants each as given by Table II. Thus we get 28
     sectors for each of the samples. Fig. 4 shows the 28                                                                                0

     sectors formed for the sample shown in Fig. 2.

        TABLE I.                      RADIUS RANGE OF THE CIRCULAR SECTORS
      Sr.           Radius range                          Sector                   Weighing
      No.                                                                          factor
      1             0≤R≤4                                 Sector1                  2/256                                              -300
      2             4≤R≤8                                 Sector2                  6/256                                                 -300          -200   -100     0        100      200         300

      3             8≤R≤16                                Sector3                  12/256
      4             16≤R32                                Sector4                  24/256
      5             32≤R≤64                               Sector5                  48/256                                           Figure 4. Sectorization of DFT plane into 28 sectors for the speech
      6             64≤R≤128                              Sector6                  96/256                                                                 sample shown in Fig. 2
      7             128≤R≤256                             Sector7                  192/256

                                                                                                                               1.   The WHT of the speech signal is taken using FWHT
                                                                                                                                    (Fast Walsh Hadamard Transform).

           200                                                                                                                 2.   The WHT can be represented as (C0, S0, C1, S1, C2,

                                                                                                                                    S2, …….., CN-1, SN-1), C represents Cal term and S
           125                                                                                                                      represents Sal term.

                                                                                                                               3.   The Walsh transform matrix is real but by

                                                                                                                                    multiplying all Sal Components by j it can be made
            0                                                                                                                       complex. The first term i.e. C0 represents dc value. So

                                                                                                                                    the complex plane is considered by combining S0
           -75                                                                                                                      with C1, S1 with C2 and so on. In this case SN-1 will be
        -100                                                                                                                        left out. Thus C0 and SN-1 are considered separately.


                                                                                                                               4.   The complex Walsh transform is then divided into
        -200                                                                                                                        circular sectors as shown by (4). Again the radial
                                                                                                                                    sectors are formed using the radius as shown in Table
           -250 -225 -200 -175 -150 -125 -100 -75   -50   -25   0   25   50   75   100 125 150 175 200 225 250

             Figure 3. Circular Sectors of the complex DFT plane of the speech                                                      Radius (R) = abs (sqrt ((Ycal)2+(Ysal)2))                        (4)
                                   sample shown in Fig. 2
                                                                                                                               5.   The seven circular sectors were further divided into
                 TABLE II.                     DIVISION INTO FOUR QUADRANTS                                                         four quadrants as explained in (A) by using Table II.
                                                                                                                                    Thus we get 28 sectors for each of the samples.
     Sr.           value                                        Quadrant
     1             Xreal≥0 & Ximag≥0                            1 (00 – 900 )                                                                   III.   FEATURE VECTOR EXTRACTION
     2             Xreal≤0 & Ximag≥0                            2 (900 – 1800)                                                 For feature vector generation, the count of the number of
     3             Xreal≤0 & Ximag≤0                            3 (1800 – 2700)                                            points in each of the sectors is first taken. Then feature vector
     4             Xreal≥0 & Ximag≤0                            4 (2700 – 3600)                                            is calculated for each of the sectors according to (5).

                                                                                                                           Feature vector = ((count/n1)*weighing factor)*10000                   (5)
B. WHT Sectorization
   The algorithm for Walsh Sectorization is given below:                                                                   For DFT, the first value i.e. dc component is accounted as in
                                                                                                                           (6). For WHT, C0 is accounted as given by (6) and SN-1 is
                                                                                                                           considered as given by (7). Overall there are eight components
                                                                                                                           in the feature vector for DFT (one per sector and first term).

                                                                                                                     141                                      http://sites.google.com/site/ijcsis/
                                                                                                                                                              ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                Vol. 9, No. 3, 2011
Similarly, there are nine components in the feature vector for                  decreases. When the complex plane is further divided into 56
WHT (one per sector, first term and last term), when the seven                  sectors, there is a improvement in accuracy for less number of
circular sectors are considered. When 28 sectors are                            samples, but as the number of samples is increased
considered there are 29 components in the feature vector (one                   performance is similar as that with 28 sectors. Fig. 6 shows the
per sector and first term) for DFT and 30 components in the
feature vector (one per sector, first term and last term) for

First term = sqrt (abs (first value of DFT/WHT))                (6)

Last term = sqrt (abs (Last value of FWHT))                     (7)

                              IV.    RESULTS

A. Database description
    The speech samples used in this work are recorded using
Sound Forge 4.5. The sampling frequency is 8000 Hz (8 bit,
mono PCM samples). Table II shows the database description.
The samples are collected from different speakers. Samples are
taken from each speaker in two sessions so that training model
and testing data can be created. Twelve samples per speaker are
taken. The samples recorded in one session are kept in database
and the samples recorded in second session are used for testing.

                TABLE III.          DATABASE DESCRIPTION
                                                                                              Figure 5. Accuracy for DFT Sectorization
               Parameter                   Sample characteristics
       Language                         English
       No. of Speakers                  30
       Speech type                      Read speech
       Recording conditions             Normal. (A silent room)
       Sampling frequency               8000 Hz
       Resolution                       8 bps

B. Experimentation
     This algorithm was tested for text dependent speaker
identification. Feature vectors for both the methods described
in section II were calculated as shown in section III. For
testing, the test sample is similarly processed and feature vector
is calculated. For recognition, the Euclidean distance between
the features of the test sample and the features of all the
samples stored in the database is computed. The sample in the
database for which the Euclidean distance is minimum, is
declared as the speaker recognized.

C. Accuracy of Identification
The accuracy of the identification system is calculated as
given by equation 5.


 Fig. 5 shows the results obtained for DFT sectorization. As                                  Figure 6. Accuracy for WHT Sectorization
seen from the results, when the complex DFT plane is divided
into seven sectors, the maximum accuracy is around 80% and                      results obtained for WHT sectorization. Here also we see that
decreases as the number of samples in the database is increased                 accuracy improves as the number of sectors is increased from
(64% for 30 samples). It can be seen that accuracy increases                    7 to 28. But further division into 56 sectors does not give any
when the number of sectors into which the complex DFT plane                     advantage. Overall the results obtained for DFT are better than
is divided, is increased from 7 to 28. With 28 sectors, the                     those obtained for WHT.
maximum accuracy is 80% up to 20 samples after which it

                                                                          142                               http://sites.google.com/site/ijcsis/
                                                                                                            ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                   Vol. 9, No. 3, 2011
                             V.     CONCLUSION                                         [17] H B Kekre, Dhirendra Mishra, “Performance Comparison of Density
                                                                                            Distribution and Sector mean of sal and cal functions in Walsh
Speaker Identification using the concept of Sectorization has                               Transform Sectors as Feature Vectors for Image Retrieval ” ,
been proposed in this paper. The complex DFT and WHT                                        International Journal of Image Processing ,Volume :4, Issue:3, 2010.
plane has been divided into circular sectors and feature vectors                       [18] H B Kekre, Dhirendra Mishra, “CBIR using Upper Six FFT Sectors of
                                                                                            Color Images for Feature Vector Generationl”, International Journal of
have been calculated using weighted density. Accuracy                                       Engineering and Technology,Volume :2(2) ”, 2010.
increases when the 7 circular sectors are divided into 28                              [19] H B Kekre, Dhirendra Mishra, “Performance Comparison of Four,
sectors for both the transform techniques. But there is no                                  Eight & Twelve Walsh Transform Sectors Feature Vectors for Image
significant improvement when the complex plane is further                                   Retrieval from Image Databases”, International Journal of Engineering
divided. The results also show that the performance of DFT is                               Science and Technology”, Volume :2(5) , 2010.
better than WHT.                                                                       [20] H B Kekre, Dhirendra Mishra, “ Four Walsh Transform Sectors
                                                                                            Feature Vectors for Image Retrieval from Image Databases ” ,
                                                                                            International Journal of Computer           Science and Information
                                                                                            Technologies”, Volume :1(2) , 2010.
                               REFERENCES                                              [21] H B Kekre, Dhirendra Mishra, “Digital Image Search & Retrieval
                                                                                            using FFT Sectors of Color Images”, International Journal of Computer
[1]    Lawrence Rabiner, Biing-Hwang Juang and B.Yegnanarayana,                             Science and Engineering”, Volume :2 , No.2, 2010.
       “Fundamental of Speech Recognition”, Prentice-Hall, Englewood Cliffs,           [22] H B Kekre, Vaishali Kulkarni, “Automatic Speaker Recognition using
       2009.                                                                                circular DFT Sector”, Interanational Conference and Workshop on
[2]    S Furui, “50 years of progress in speech and speaker recognition                     Emerging Trends in Technology (ICWET 2011), 25-26 February, 2011.
       research”, ECTI Transactions on Computer andInformation Technology,             [23] Bergland, G. D. "A Guided Tour of the Fast Fourier Transform." IEEE
       Vol. 1, No.2, November 2005.                                                         Spectrum 6, 41-52, July 1969
[3]    D. A. Reynolds, “An overview of automatic speaker recognition                   [24] Walker, J. S. Fast Fourier Transform, 2nd ed. Boca Raton, FL: CRC
       technology,” Proc. IEEE Int. Conf. Acoust., Speech,S on Speech and                   Press, 1996.
       Audio Processing, Vol. 7, No. 1, January 1999. IEEE, New York, NY,              [25] Terry Ritter, Walsh-Hadamard Transforms: A Literature Survey, Aug.
       U.S.A                                                                                1996.
[4]    S. Furui. Recent advances in speaker recognition. AVBPA97, pp 237--
       251, 1997
[5]    J. P. Campbell, ``Speaker recognition: A tutorial,'' Proceedings of the
       IEEE, vol. 85, pp. 1437--1462, September 1997.                                                              AUTHORS PROFILE
[6]    D. A. Reynolds, “Experimental evaluation of features for robust speaker
       identification,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp.          Dr. H. B. Kekre has received B.E. (Hons.) in Telecomm. Engg. from Jabalpur
       639–643, Oct. 1994.                                                             University in 1958, M.Tech (Industrial Electronics) from IIT Bombay in 1960,
[7]    Tomi Kinnunen, Evgeny Karpov, and Pasi Fr¨anti, “Realtime Speaker               M.S.Engg. (Electrical Engg.) from University of Ottawa in 1965 and Ph.D.
       Identification”, ICSLP2004.                                                                          (System Identification) from IIT Bombay in 1970. He
                                                                                                            has worked Over 35 years as Faculty of Electrical
[8]    F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-                                     Engineering and then HOD Computer Science and Engg.
       Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D.Petrovska-                                  at IIT Bombay. For last 13 years worked as a Professor in
       Delacrétaz, and D. A. Reynolds, “A tutorial on text-independent speaker                              Department of Computer Engg. at Thadomal Shahani
       verification,” EURASIP J. Appl. Signal Process., vol. 2004, no. 1, pp.
                                                                                                            Engineering College, Mumbai. He is currently Senior
       430–451, 2004.
                                                                                       Professor working with Mukesh Patel School of Technology Management and
[9]     Marco Grimaldi and Fred Cummins, “Speaker Identification using                 Engineering, SVKM’s NMIMS University, Vile Parle(w), Mumbai, INDIA.
       Instantaneous Frequencies”, IEEE Transactions on Audio, Speech, and             He ha guided 17 Ph.D.s, 150 M.E./M.Tech Projects and several B.E./B.Tech
       Language Processing, vol., 16, no. 6, August 2008.                              Projects. His areas of interest are Digital Signal processing, Image Processing
[10]    Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi, Yu. (1999). “Binary                and Computer Networks. He has more than 300 papers in National /
       Quantization of Feature Vectors for Robust Text-Independent Speaker             International Conferences / Journals to his credit. Recently twelve students
       Identification” in IEEE Transactions.                                           working under his guidance have received best paper awards. Recently two
[11]   Dr. H B Kekre, Vaishali Kulkarni,”Speaker Identification using Power            research scholars have received Ph. D. degree from NMIMS University
       Distribution in Frequency Spectrum”, Technopath, Journal of Science,            Currently he is guiding ten Ph.D. students. He is member of ISTE and IETE.
       Engineering & Technology Management, Vol. 02, No.1, January 2010.
[12]   Dr. H B Kekre, Vaishali Kulkarni, “Speaker Identification by using                                 Vaishali Kulkarni has received B.E in Electronics
       Power Distribution in Frequency Spectrum”, ThinkQuest - 2010                                       Engg. from Mumbai University in 1997, M.E (Electronics
       International Conference on Contours of Computing Technology”,                                     and Telecom) from Mumbai University in 2006. Presently
       BGIT, Mumbai,13th -14th March 2010.                                                                she is pursuing Ph. D from NMIMS University. She has a
[13]     H B Kekre, Vaishali Kulkarni, “Speaker Identification by using Vector                            teaching experience of more than 8 years. She is Associate
       Quantization”, International Journal of Engineering Science and                                    Professor in telecom Department in MPSTME, NMIMS
       Technology, May 2010.                                                                              University. Her areas of interest include Speech
                                                                                       processing: Speech and Speaker Recognition. She has 10 papers in National /
[14]   H B Kekre, Vaishali Kulkarni, “Performance Comparison of Speaker
                                                                                       International Conferences / Journals to her credit.
       Recognition using Vector Quantization by LBG and KFCG ” ,
       International Journal of Computer Applications, vol. 3, July 2010.
[15]    H B Kekre, Vaishali Kulkarni, “ Performance Comparison of
       Automatic Speaker Recognition using Vector Quantization by LBG
       KFCG and KMCG”, International Journal of Computer Science and
       Security, Vol: 4 Issue: 5, 2010.
[16]     H B Kekre, Vaishali Kulkarni, “Comparative Analysis of Automatic
       Speaker Recognition using Kekre’s Fast Codebook Generation
       Algorithm in Time Domain and Transform Domain ” , International
       Journal of Computer Applications, Volume 7 No.1. September 2010.

                                                                                 143                                    http://sites.google.com/site/ijcsis/
                                                                                                                        ISSN 1947-5500

To top