Docstoc
EXCLUSIVE OFFER FOR DOCSTOC USERS
Try the all-new QuickBooks Online for FREE.  No credit card required.

09_Kyaw Thuta

Document Sample
09_Kyaw Thuta Powered By Docstoc
					              ROBUST SPEAKER
      IDENTIFICATION/VERIFICATION FOR
          TELEPHONY APPLICATIONS


          ENG 499 CAPSTONE PROJECT COURSE


A project report submitted to SIM University in partial fulfillment of the

      requirements for Bachelor Degree of Electronic Engineering



               STUDENT:            Kyaw Thu Ta (H0605351)
               SUPERVISOR:         Sirajudeen s/o Gulam Razul (Dr)
               PROJECT CODE: Jan09/BEHE/12


        SCHOOL OF SCIENCE AND TECHNOLOGY
                         SIM UNIVERSITY

                           November 2009
Robust Speaker Identification/Verification                      Kyaw Thu Ta (H0605351)
For Telephony Applications

                                    Acknowledgements


Initially, I would like to thank my project supervisor, Dr Sirajudeen Gulam Razul for his
guidance and support, especially for being such an endless supply of ideas and knowledge
throughout the project. His expert experience in biometric research, valuable comments and
suggestions has been very useful in solving problems of the project.


I gratefully acknowledge the UniSIM project instructors for providing me the chance to study
in the exciting and challenging research areas of speaker recognition techniques


I sincerely thank to my employer, Seagate Technology for allowing me to further study
towards Bachelor Degree. I am also grateful to my manager, Mr. Daniel Gei, my engineer,
Mr. L K Sea and my senior, Mr. Derek Chua for giving me time off during the course of
study.


Finally, my special thanks to my great parents, brothers and sisters for their love, inspiration
and constant moral support during my academic years.




                                              - i-
Robust Speaker Identification/Verification                     Kyaw Thu Ta (H0605351)
For Telephony Applications

                                          Abstracts


Speaker recognition (identification/verification) is the computing task of validating a user‟s
claimed identity using speaker specific information included in speech waves: that is, it
enables access control of various services by voice.


The automatic speaker recognition technologies have developed into more and more
important modern technologies required by many speech-aided applications. The main
challenge for automatic speaker recognition is to deal with the variability of the environments
and channels where the speech was obtained. In previous cases, Gaussian Mixture Models
(GMM) based systems for speaker recognition have shown robust results for several years
and are widely used in speaker recognition applications.


In this project, Gaussian Mixture Model (GMM) text-independent speaker verification was
applied due to the high accuracy of verification performance. TIMIT database was used for
speaker verification and Gaussian Mixture Model (GMM) speaker recognition algorithm was
converted to a speaker verification system.


This project investigated the impact of the speaker verification system on telephone channel
by simulating the telephone quality speech. To get the better accuracy on telephone channel
speaker verification, we implemented and applied inverse filtering on the telephone quality
speech. A comparison is performed among the clean speech system simulation, the telephone
quality speech system simulation which uses low-pass filtered speech at cut-off frequency 3.4
kHz for testing and the enhanced telephone quality speech system simulation.


One of the effects that can greatly affect the performance of a speech processing system is
noise. We investigate the performance of the speaker verification system adding the noise on
clean speech with the various signals to noise ratio (SNR).


From the comparison result, implementing the inverse filter gives the comparable result with
training with telephone quality speech. For noise speech simulation, the lower the SNR, the
more impact on the accuracy of the speaker verification system.




                                              - ii-
Robust Speaker Identification/Verification                                                 Kyaw Thu Ta (H0605351)
For Telephony Applications

                                                       List of Figures
Figure 1.1: Structure of Speaker Recognition System ............................................................... 2
Figure 2.1: Basic structure of Biometric system ........................................................................ 6
Figure 2.2: Basic structure of Speaker recognition systems ...................................................... 8
Figure 2.3: Typical user authentication in Phone banking system .......................................... 12
Figure 3.1: Project Gantt chart ................................................................................................. 15
Figure 4.1: Speech samples of two different persons in TIMIT database ............................... 17
Figure 4.2: General filter bank ................................................................................................. 19
Figure 5.1: Client and impostor scores overlapping illustration .............................................. 26
Figure 5.2: Equal Error Rate (EER) ......................................................................................... 26
Figure 5.3: ROC for clean speech speaker verification system ............................................... 28
Figure 5.4: EER for clean speech speaker verification system ................................................ 29
Figure 5.5: Flow diagram for telephone channel simulation ................................................... 29
Figure 5.6: Frequency response of the Low-Pass Filter .......................................................... 30
Figure 5.7: Clean speech and telephone quality speech .......................................................... 30
Figure 5.8: ROC for telephone quality speech speaker verification system ............................ 31
Figure 5.9: EER for telephone quality speech speaker verification system ............................ 31
Figure 5.10: ROC for telephone quality speech testing with clean database........................... 32
Figure 5.11: EER for telephone quality speech testing with clean database ........................... 33
Figure 5.12: Flow diagram for inverse filtering....................................................................... 33
Figure 5.13: Flow diagram of adaptive filter ........................................................................... 34
Figure 5.14: Frequency response of adaptive filter ................................................................. 34
Figure 5.15: Telephone quality speech and the speech after inverse filtering ......................... 34
Figure 5.16: ROC for speech after inverse filtering the telephone quality speech testing with
                  clean database ..................................................................................................... 35
Figure 5.17: EER for speech after inverse filtering the telephone quality speech testing with
                  clean database ..................................................................................................... 36
Figure 5.18: Flow diagram of adding noise to clean speech.................................................... 36
Figure 5.19: Clean speech and speech with white noise .......................................................... 37
Figure 5.20: ROC for noise speech (SNR 10 dB) testing with clean database ........................ 37
Figure 5.21: EER for noise speech (SNR 10 dB) testing with clean database ........................ 38
Figure 5.22: ROC for noise speech (SNR 30 dB) testing with clean database ........................ 39
Figure 5.23: EER for noise speech (SNR 30 dB) testing with clean database ........................ 40



                                                                - iii-
Robust Speaker Identification/Verification                                              Kyaw Thu Ta (H0605351)
For Telephony Applications

                                                      List of Tables
Table 1.1: Speaker Verification database arrangements ............................................................ 3
Table 3.1: Detail Project Plan .................................................................................................. 14
Table 4.1: Speaker database arrangement ................................................................................ 16
Table 5.1: Experimental results (testing) for clean speech speaker verification system ......... 27
Table 5.2: Experimental results (final) for clean speech speaker verification system ............. 28
Table 5.3: Experimental results for telephone quality speech speaker verification system .... 30
Table 5.4: Experimental results of speaker verification system for telephone quality speech
              testing with clean database ...................................................................................... 32
Table 5.5: Experimental results of speaker verification system for speech after inverse
              filtering the telephone quality speech with clean database ..................................... 35
Table 5.6: Experimental results of speaker verification system for SNR 10 dB noise speech
              testing with clean database ...................................................................................... 37
Table 5.7: Experimental results of speaker verification system for SNR 15 dB noise speech
              testing with clean database ...................................................................................... 38
Table 5.8: Experimental results of speaker verification system for SNR 20 dB noise speech
              testing with clean database ...................................................................................... 38
Table 5.9: Experimental results of speaker verification system for SNR 25 dB noise speech
              testing with clean database ...................................................................................... 39
Table 5.10: Experimental results of speaker verification system for SNR 30 dB noise speech
              testing with clean database ...................................................................................... 39
Table 5.11: Experimental Results Summary ........................................................................... 40
Table 5.12: Experimental Results Summary for noise speech ................................................ 41




                                                            - iv-
Robust Speaker Identification/Verification                                                       Kyaw Thu Ta (H0605351)
For Telephony Applications

                                                     Table of Contents

Acknowledgements .................................................................................................................... i
Abstracts ................................................................................................................................... ii
List of Figures .......................................................................................................................... iii
List of Tables ........................................................................................................................... iv

CHAPTER 1: Introduction ........................................................................................................1
   1.1         Overview of the project .............................................................................................1
   1.2         Project objective.........................................................................................................4
   1.3         Project Scope .............................................................................................................4

CHAPTER 2: Literature Review ...............................................................................................5
   2.1         Review on the operation of biometric system............................................................5
   2.2         Reviews on the operation of Speaker Recognition system ........................................7
   2.3         Feature vector analysis ...............................................................................................9
   2.4         Why simulation over telephone channel? ................................................................11

CHAPTER 3: PROJECT PLAN ..............................................................................................13

CHAPTER 4: Design of the Speaker Verification System ......................................................16
   4.1         Speech database .......................................................................................................16
   4.2         Feature Extraction ....................................................................................................17
        4.2.1        Mel-Frequency Cepstral Coefficients (MFCC) .................................................18
   4.3         Speaker Modeling ....................................................................................................20
        4.3.1        Gaussian Mixture Model (GMM) .....................................................................20
        4.3.2        Maximum Likelihood Training .........................................................................21
   4.4         Speaker Match Score ...............................................................................................22
   4.5         Outlines of speaker verification system algorithms .................................................23

CHAPTER 5: Experimental results .........................................................................................25
   5.1         Performance measure of the speaker verification systems ......................................25
   5.2         Experiment Approach ..............................................................................................27
   5.3         Speaker Verification Experimental results ..............................................................27
        5.3.1        Clean speech training and clean speech testing.................................................27



                                                                  - v-
Robust Speaker Identification/Verification                                                     Kyaw Thu Ta (H0605351)
For Telephony Applications


        5.3.2        Telephone quality speech training and telephone quality speech testing ..........29
        5.3.3        Clean speech training and telephone quality speech testing .............................32
        5.3.4        Clean speech training and speech after inverse filtering telephone quality
                     speech testing ....................................................................................................33
        5.3.5        Clean speech training and speech adding noise to clean speech testing ...........36
   5.4         Comparison of Experimental Results ......................................................................40

CHAPTER 6: Conclusion and Recommendation ....................................................................42
   6.1         Conclusion ...............................................................................................................42
   6.2         Recommendations for future study ..........................................................................43

Critical Review and Reflections ..............................................................................................44
Bibliography ............................................................................................................................46

APPENDICES .........................................................................................................................49
   Appendix A:                Wave Conversion Algorithms..................................................................49
        A.1          Clean Speech Wave Conversion Algorithm ......................................................49
        A.2          Telephone Quality Speech (simulation) Wave Conversion Algorithm .............55
        A.3          Low-Pass Filter Frequency Response Program .................................................56
        A.4          Adaptive Filter training algorithm and its frequency response .........................56
        A.5          Wave Conversion Algorithm by using LMS adaptive filter .............................57
        A.6          Noise speech Conversion Algorithm .................................................................59
   Appendix B:                Speaker Verification System Algorithms ................................................61
        B.1          Speaker Training Module Algorithm ................................................................61
        B.2          Speaker Background Model Algorithm ............................................................62
        B.3          Speaker Client Evaluation Algorithm ...............................................................63
        B.4          Speaker Impostor Evaluation Algorithm ...........................................................65
        B.5          ROC and EER Algorithm ..................................................................................67

Glossary ...................................................................................................................................69




                                                                - vi-
Robust Speaker Identification/Verification                       Kyaw Thu Ta (H0605351)
For Telephony Applications

                                          PART 1
CHAPTER 1: Introduction

1.1    Overview of the project

       Every sector in modern world needs secure and easy access to database. There have
been different methods of verifying user‟s identity includes memory based security (such as
password) and token based mechanisms (such as smart cards or ID cards) in many sectors.
These traditional methods of user authentications normally based on memory that can be
forgotten or something that can be disclosed, lost or stolen. As a result, the system cannot be
differentiated between a client and imposter. Hence they are not dependable identification
system in this modern world.


       In order to fulfill the secure access to the database, Biometric authentications can be
introduced in systems or services for restricting their use to only authorized people. It
normally based on the user‟s physiological and behavioral characteristics such as finger print
or voice that are rather distinct and not transferrable. Physiological characteristics consist of
hand or finger images, facial characteristics and iris recognition. Behavioral characteristics
includes speaker verification, dynamic signature verification and keystroke dynamics.


       In the process of voice biometric, it can be separated into Speaker identification and
verification. Speaker identification is the process of determining from which of the registered
speakers, a given utterance comes. Speaker verification is the process of accepting or
rejecting the identity claim of a speaker [23]. In both verification and identification processes,
an additional threshold test can be used to determine whether the identity claim is accepted or
rejected. A high threshold makes it difficult for impostor to be accepted by the system, but at
the risk of rejecting the customer. Conversely, a low threshold ensures that the customer is
accepted consistently, but at the risk of accepting imposters [23].


        In order to set a threshold at a desired level of customer rejection (false rejection) and
imposter acceptance (false acceptance), it is necessary to know the distribution of customer
and imposter scores. The effectiveness of speaker-verification systems can be evaluated by
using the receiver operating characteristics (ROC) curve adopted from psychophysics [23].




                                             - 1-
Robust Speaker Identification/Verification                          Kyaw Thu Ta (H0605351)
For Telephony Applications
       The fundamental difference between identification and verification is the number of
decision alternatives. In identification, the number of decision alternatives is equal to the size
of the population, whereas in verification there are two decision alternatives, acceptance or
rejection, regardless of the population size. Therefore, speaker identification performance
decreases as the size of population size increases, whereas speaker verification performance
approaches a constant, independent of the size of population, unless the distribution of
physical characteristics of speakers is extremely biased [1].

                                      Training          Reference
                                                    Templates/models
                                                     for each speaker

         Speech           Feature
          wave           Extraction

                                                       Similarity              Recognition
                                                       (Distance)                 result
                                  Recognition

                         Figure 1.1: Structure of Speaker Recognition System


       The flow diagram of this project is shown in above figure. Feature parameters
extracted from input speech wave are compared with the stored reference templates or
models for each registered speaker. The recognition decision is made according to the
distance (or similarity) values. For Speaker verification, input utterances with distances to the
reference template smaller than the threshold are accepted as being utterance of the registered
speaker (customer), while input utterances with distances larger than the threshold are
rejected as being those of a different speaker. With speaker identification, the registered
speaker whose reference template is nearest to the input utterance between all of the
registered speakers is selected as being the speaker of the input utterance.


       In verification, the goal is to determine from a voice sample if a person is whom he or
she claims. In speaker identification, the goal is to determine which one of a group of known
voices best matches the input voice sample. Furthermore, in either task the speech can be
constrained to be a known phrase (text-dependent) or totally unconstrained (text-
independent). Success in both tasks depends on extracting and modeling the speaker-
dependent characteristics of the speech signal, which can effectively distinguish one talker,
form another.


                                             - 2-
Robust Speaker Identification/Verification                      Kyaw Thu Ta (H0605351)
For Telephony Applications
       The speaker recognition system takes the complex (and secret) series of
measurements that result in a mathematical representation of the user‟s voice. The various
technologies used to process and store voice prints include frequency estimation, hidden
Markov models, Gaussian mixture models, pattern matching algorithms, neural networks, and
matrix representation and decision trees.


       A well-known method of speaker verification is hidden Markov model (HMM). It can
efficiently model statistical variation in spectral feature. By using this method, better
accuracies can be achieved significantly. Gaussian mixture models (GMM) is normally used
to evaluate for the speaker identification. GMM system models the speaker identity by the
interpretation that Gaussian components represent some general speaker-dependent spectral
shapes and the capability of Gaussian mixtures to model arbitrary densities.


       In this project, Gaussian Mixture Model (GMM) text-independent speaker verification
approach is employed due to its high accuracy performance. The detail of GMM approach is
presented in chapter 4 and the experimental results and conclusion are expressed in chapter 5
and 6 respectively.


       In addition, for recognition accuracy, we use telephone-channel simulation to identify
the sources of degradation of speech over telephone lines that have the greatest impact on
speech recognition accuracy. Comparison between telephone quality speech and speech using
the inverse filtering method to distinctly recover the original speech as well as the effect of
white noise were evaluated. TIMIT database was used for the simulation of speaker
verification, as they are provided by the project supervisor. Voice database were used as
shown in table 1.1.

                        Training Background        Client        Impostor
         Voice                  Model Training  Evaluation      Evaluation
       Biometric Models Samples Models Samples Models Samples Models Samples
                  40       8     20      10     40        2    42        2
                      Table 1.1: Speaker Verification database arrangements




                                             - 3-
Robust Speaker Identification/Verification                         Kyaw Thu Ta (H0605351)
For Telephony Applications


1.2       Project objective

         The main objective of this project is to develop a prototype robust speaker recognition
          system based on a spoken user speech as well as the simulation of speaker verification
          system when speech is transmitted via telephone.
         The academic goal of this project is to develop skills in combination of research,
          design, programming and analysis.
         Other goals are to improve project management and problem solving skills.




1.3       Project Scope

The project will include the following main tasks:
1. Conversion of raw speech signals to normal wave signals
2. Evaluation of feature extraction from normal wave signals
3. Evaluation of speaker verification system
4. Evaluation and training of speech using Gaussian Mixture Model (GMM) technique
5. Simulation of speaker verification system when speech is transmitted via telephone
6. Performance comparison of clean speech and telephone quality speech simulation system
7. Performance comparison of telephone quality speech and the speech-using inverse
      filtering method to distinctly recover the original speech
8. Effect of white noise over speaker verification system.




                                              - 4-
Robust Speaker Identification/Verification                        Kyaw Thu Ta (H0605351)
For Telephony Applications

                              CHAPTER 2: Literature Review

2.1     Review on the operation of biometric system

        Biometric recognition (also known as biometrics) refers to the automated recognition
of individuals based on their biological and behavioral traits. Examples of biometric trait
include fingerprint, face, iris, palmprint, retina, hand geometry, voice, signature and gait.


        A biometric system is a computer system that implements biometric recognition
algorithms. A typical biometric system consists of sensing, feature extraction, and matching
modules. Biometric sensors (e.g., fingerprint sensor, digital camera for face) capture or scan
the biometric trait of an individual to produce its digital representation. A quality check is
generally performed to ensure that the acquired biometric sample can be reliably processed
by the subsequent feature extraction and matching modules. The feature extraction module
discards the unnecessary and extraneous information from the acquired samples, and extracts
salient and discriminatory information called features that are generally used for matching.
During matching, the query biometric sample is matched with the reference information
stored in the database to establish the identity associated with the query.


        Generally, a biometric system has two stages of operation: enrollment and
recognition. Enrollment refers to the stage in which the system stores some biometric
reference information about the person in a database. This reference information may be in
the form of a template (features extracted from the biometric sample or parameters of a
mathematical model that best characterizes the extracted features) or the biometric sample
itself (e.g., faces or fingerprint image).


        In many applications, some identity attributes about the person (name, ID number,
etc.) are also stored along with the biometric reference. When no personal identity
information is available (e.g., unknown latent prints lifted from a crime scene, anonymous
authentication applications, etc.), the reference is usually tagged with a system-generated ID
for future recognition. In the recognition stage, the system scans the user‟s biometric trait,
extracts features, and matches them against the reference biometric information stored in the
database. A high similarity score between the query and the reference data results in the user
being authenticated or identified. Below figure shows the basic structure of biometric system.



                                             - 5-
Robust Speaker Identification/Verification                         Kyaw Thu Ta (H0605351)
For Telephony Applications




                                                             Enrollment        Stored
                                                                              Templates

                                                                                    Test


                Pre-              Feature              Template        Test    Matcher
             processing          Extractor             Generator

                                                    Biometric System


      User                                                                    Application
               Sensor
                                                                               Device


                           Figure 2.1: Basic structure of Biometric system


Biometric recognition systems typically provide two different functionalities:


(a) Verification (“Is this the person who he claims to be?”). For example, a person claims that
he is John and offers his fingerprint; the system then either accepts or rejects the claim based
on comparing the offered pattern (query or input) and the enrolled pattern (reference)
associated with the claimed (John) identity.


(b) Identification (“Is this person in the database?”). Given an input biometric sample, the
system determines if this pattern is associated with any one of a usually large number (e.g.,
millions) of enrolled identities. There are two types of identification scenarios. In positive
identification, the person asserts that the biometric system knows him. In negative
identification, the person asserts that the biometric system does not know him. In both
scenarios, the system confirms or refutes the person's assertion by acquiring his biometric
sample and comparing it against all templates in the database [29].




                                             - 6-
Robust Speaker Identification/Verification                         Kyaw Thu Ta (H0605351)
For Telephony Applications


2.2     Reviews on the operation of Speaker Recognition system

        As human beings, we are able to recognize someone just by hearing him or her talk.
Usually, a few seconds of speech are sufficient to identify a familiar voice. The idea of to
teach computers how to recognize humans by the sound of their voices are called Speaker
Recognition. Speaker recognition is the process of automatically recognizing who is speaking
by using speaker-specific information included in speech waves to verify identities being
claimed by people accessing systems; that is, it enables access control of various services by
voice. Applicable services include voice dialing, banking over a telephone network,
telephone shopping, database access services, information and reservation services, voice
mail, security control for confidential information, and remote access to computers.


        The task of speaker recognition is a classical example of a pattern recognition
problem, which in general finds some kind of patterns within some real-world sensor data.
For all problems of pattern recognition, a training phase is required. For the example of a
speaker authentication system, valid users of the system need to be enrolled. During the
enrollment procedure, the system “learns” the person it is supposed to recognize. Speech
samples of the user are required for this training phase. During the later recognition process,
the system compares another recorded speech signal (also called test data) to the training
utterances. The desired output of the system is the name of one of the training speakers, or a
rejection if the test utterance stems from an unknown person [31].


        Below     figures    show     the    basic    structures     of   speaker    recognition
(identification/verification) system. Speaker identification is the process of determining
which registered speaker provides a given utterance. Speaker verification, on the other hand,
is the process of accepting or rejection the identity claim of a speaker. Most applications in
which a voice is used as the key to confirm the identity of a speaker are classified verification
[30].




                                             - 7-
Robust Speaker Identification/Verification                        Kyaw Thu Ta (H0605351)
For Telephony Applications




                                                                             Identification
           Input      Feature                                   Maximum
                                             Similarity                          result
          speech      Extraction                                Selection
                                                                             (Speaker ID)

                                            Reference
                                         template or model
                                           (Speaker #1)


                                             Similarity


                                            Reference
                                         template or model
                                           (Speaker #1)




                                             Similarity


                                            Reference
                                         template or model
                                           (Speaker #1)




                                   (a) Speaker Identification


                                                                            Identification
         Input       Feature                                                    result
        Speech                            Similarity         Decision
                     Extraction                                             (Speaker ID)

                                            Reference
                   Speaker ID           template or model
                     (#M)                                    Threshold
                                          (Speaker #M)




                                   (b) Speaker Verification


                     Figure 2.2: Basic structure of Speaker recognition systems


       Both speaker verification and identification can be classified into text dependent and
text independent applications based on whether or not the person is required to speak pre-
determined words or sentences. Most text dependent speaker recognition systems use the


                                             - 8-
Robust Speaker Identification/Verification                    Kyaw Thu Ta (H0605351)
For Telephony Applications
concept of Hidden Markov Models (HMMs). It is random based models that provide a
statistical representation of the sounds produced by the individual. The HMM represents the
underlying variations and temporal changes over time found in the speech states using the
quality, duration, intensity dynamics and pitch characteristics [25]. For text independent
speaker recognition systems, Gaussian Mixture Models (GMMs) is widely used. Like HMM,
this method uses the voice to create a number of vector states representing the various sound
forms, which are characteristic of the physiology and behavior of the individual. These
methods all compare the similarities and differences between the input voice and the stored
voice state to produce a recognition decision.




2.3      Feature vector analysis

         Over the years various speech features have been investigated. Among these are
intensity, pitch, short-time spectrum, LP cepstrum, formants, nasal co-articulation, spectral
correlation, harmonic features and cepstral measures. The most popular speech features in
recent years has been the cepstrum and it has been suggested that it is superior for speech
processing applications. Of these, the LP cepstrum and Mel-warped cepstrum are frequently
used.


         In attempts to increase the robustness of speech processing systems on the feature
level, researchers have either attempted to modify the features or to derive new types of
feature form LP cepstrum or Mel-warped cepstrum [17].


Below follows a brief synopsis of the some important techniques.
       Perceptual linear prediction (PLP)
Perceptual linear prediction is based on the short-term spectrum of speech that has been
modified by psychophysical spectral transformation. The PLP technique attempts to simulate
some of the properties of hearing namely:


                  Critical-band resolution curve.
                  Equal-loudness curve
                  Intensity-loudness power-law relation



                                             - 9-
Robust Speaker Identification/Verification                                  Kyaw Thu Ta (H0605351)
For Telephony Applications


After these modifications have been applied in the frequency domain, LP coefficients are
calculated to form a new speech feature. While this feature improves on the robustness of
speech recognition system it is still (just like other features that are based on the short-time
spectrum) susceptible to changes in the frequency response of the communication channel
and noise [17].


    Linear predictive coding


   A commonly used method is linear prediction or linear predictive coding (LPC). LPC
assumes that speech can be modeled as the output of a linear, time-varying system excited by
either periodic pulses (to model voiced speech, which is generated by the vibration of the
vocal cords) or random noise (to model unvoiced speech, generated when the vocal cords are
relaxed).


   LPC performs spectral analysis on windows of speech using an all-pole modeling
constraint. Within each window the sample at time t , s  t  is assumed to be expressible as

linear combination of the past p samples plus some excitation term, Gu t  :


                                        p
                            s  t    ai s  t  i   Gu  t    --------- (2.1)
                                       i 1




where a i are assumed to be constant over the window of speech, G is the gain and u  t  is

the normalized excitation. Expressing equation (2.1) in the z-domain


                                p
                     S  z    ai z  i S  z   GU  z .         ------------- (2.2)
                               i 1




where S is the z transform of s and likewise for U . Rearranging (2.2) gives
                                       S  z       1
                        H  z                               ------------- (2.3)
                                      GU  z  1   p ai z i
                                                    i 1


which corresponds to the transfer function of a digital time-varying all-pole filter.



                                                   - 10-
Robust Speaker Identification/Verification                      Kyaw Thu Ta (H0605351)
For Telephony Applications


        The parameter, p , is called the LPC analysis order. The output of the LPC analysis on
a window of speech is the vector of p coefficients (known as the autoregressive coefficient),
a1    a p , that specify solution for these LPC coefficients is obtained by minimizing the mean

squared error between the model and the signal and solving the resultant set of
p simultaneous equations. The autoregressive coefficients can be further transformed into the
cepstral coefficients of the all-pole model [9].




2.4     Why simulation over telephone channel?

        Although much of today‟s banking is carried out over the Internet there are still those
among us who prefer to use the now rather old fashioned telephone banking facilities that
banks and building societies still offer their clients.


        Many would argue that telephone banking is safer than Internet banking given the
recent increase in Internet and Identity Fraud. But it was found that weak authentication
measures continue to be utilized for phone banking.


        This lack of effective and strong authentication for phone banking will lead to
fraudsters increasingly targeting the services. Nowadays, fraudsters call the customers and
con them into disclosing personal data over the phone. This data is then used to access
accounts via telephone banking services. In another way, voice over Internet Protocol (VoIP)
technology is used to con customers into disclosing personal data.


        For the moment, the bank which offered phone banking services has to employ highly
trained staffs for banking transactions. The improvement in technology make automated
phone banking services possible. A popular choice for remote authentication is speaker
recognition for its ease of integration and readily available device for collecting speech
samples over telephone network. Figure 2.3 shows the typical authentication process in phone
banking system.




                                             - 11-
Robust Speaker Identification/Verification                       Kyaw Thu Ta (H0605351)
For Telephony Applications

                User call bank


               Say the password

                                                           Yes
                                                                  Access account

                    Speaker       User
                                               Match
                  recognition

                Who is speaking                                       Reject
                                                            No



                  Figure 2.3: Typical user authentication in Phone banking system


       Some of the significant advantages of automated phone banking over traditional
phone banking services are reduction in cost (i.e. instead of hiring people, automated machine
will authenticate and verify whether the user is who he claim he is), less human errors, ease
of use and very effective authentication compared to traditional human interface where the
impostor (fraudsters) can easily authorize the transaction if they have the personal
information of the user.


       But the recognition accuracy percentages are lower for speech over the telephone
network than for speech that is carefully recorded in a quiet environment. In this project, we
describe a series of simulations that attempt to evaluate the effects of several causes of
telephone-channel degradation, to determine which of these impairments have the greatest
impact on recognition accuracy as well as how to recover as close to the original recorded
data as possible by studying Least Mean Square algorithm of the adaptive filter.




                                             - 12-
Robust Speaker Identification/Verification                     Kyaw Thu Ta (H0605351)
For Telephony Applications

CHAPTER 3: PROJECT PLAN
Project tasks are divided into nine sections.
1. Project Proposal and Approval Process
2. Literature Search
3. Preparing for Initial Report (TMA01)
4. Evaluation o featured extraction from raw speech signal
5. Programming Speaker Verification (SV) system
6. Evaluation and training speech using GMM technique
7. Testing of SV system over telephony applications
8. Preparing for Final Report (Thesis)
9. Preparing for oral Presentation
 In Task 1, we are allowed to choose the 10 project and need to submit for approval. It
    takes us about 7 days to choose. And the project committee makes allocation for
    proposed project. It takes about 8 days to approve.
 Since Literature research is one of the most important steps for understanding of the
    project, 31 days are used for Task 2.
 Since preparation of initial report partially depends on Task 2. Task 2 and 3 were carried
    out at the same time.
 We set to complete piratical project work on 31 Aug 2009. The duration for Task 4, 5, 6
    and 7 is 183 days. Since the main objective is to develop the speaker
    identification/verification, Task 4 for 37 days, Task 5 for 26 days and Task 6 for 48 days
    are distributed.
 For task 8, preparation for final report is 61 days as it is portrayal of our whole project
    work and 40% of capstone project score is carried by this task. It will start from 1 Sep 09
    and end at submission date 9 Nov 09 and it will also be carried out concurrently with
    task 7.
 For task 9, preparation for oral presentation will start 3 weeks before finish writing of the
    report. There are 22 days available for this task.




                                             - 13-
    Robust Speaker Identification/Verification                        Kyaw Thu Ta (H0605351)
    For Telephony Applications
    Detailed Project Plan and Gantt chart are attached below.

    Robust Speaker Identification/Verification for Telephony Applications Project Plan
                                                                                   Duration
                 Tasks Description                       Start Date    End Date                Resources
                                                                                   ( Days )
1. Project Proposal and Approval Process                  2-Jan-09     16-Jan-09     15
   1.1 Prepare and submit PPA form                        2-Jan-09    8-Jan-09        7
   1.2 Approval of project proposed                                                              Library
                                                          9-Feb-09     16-Jan-09      8
                                                                                               Resources
2. Literature Search                                     17-Jan-09    16-Feb-09      31       ( Reference
   2.1 Research on IEEE online journals , relevant                                              Books )
        reference books and former student thesis        17-Jan-09     30-Jan-09     14
        report
   2.2 Analyze and study relevant books and journals     31-Jan-09    16-Feb-09      17
3. Preparation of initial report (TMA01)                 17-Feb-09    26-Feb-09      10            Web
4. Evaluation of feature extraction from raw speech                                            Resources
                                                         27-Feb-09     6-Apr-09      37          ( IEEE
   Signal
                                                                                               Journals,
   4.1 Review and decide which method to be used         27-Feb-09     6-Mar-09       7
                                                                                              Past Thesis,
   4.2 Evaluation of feature extraction method            7-Mar-09     6-Apr-09      30         Source
5. Programming of Speaker Verification (SV)                                                     codes,
                                                          7-Apr-09     2-May-09      26
System                                                                                          Related
  5.1 Study of existing matlab code for SV system         7-Apr-09     15-Apr-09      9        reference
  5.2 Create and modify codes for SV process             16-Apr-09     24-Apr-09      9         Books)
  5.3 Evaluation of SV system performance                25-Apr-09     2-May-09       8
6. Evaluation and training of speech using GMM
                                                         3-May-09     19-Jun-09      48
  Technique                                                                                   School Lab
 6.1 Review and evaluate on decided method                3-May-09    25-May-09      23        Facility &
 6.2 Programming and evaluate on decided method          26-May-09     2-Jun-09       8        Personal
 6.3 Evaluation of training performances                  3-Jun-09    19-Jun-09      17        computer
7. Testing of SV system over telephony application       20-Jun-09    31-Aug-09      72
8. Preparation for final report                           1-Sep-09    31-Oct-09      61
 8.1 Writing skeleton of final report                     1-Sep-09     7-Sep-09       8         MATLAB
 8.2 Writing Literature search                            8-Sep-09    15-Sep-09       8         software
 8.3 Writing Introduction of report                      16-Sep-09    22-Sep-09       8
 8.4 Writing Main body of report                         23-Sep-09    14-Oct-09      22
 8.5 Writing conclusion and further study                15-Oct-09    20-Sep-09       5          Speaker
 8.6 Finalizing and amendments of report                 21-Oct-09    31-Oct-09      10       identification
9. Preparation for oral presentation                      1-Nov-09    25-Nov-09      22        /verification
  9.1 Review the whole project and decide for                                                   references
                                                          1-Nov-09    15-Nov-09
presentation                                                                         12
  9.2 Prepare poster for presentation                    16-Nov-09    25-Nov-09      10


                                     Table 3.1: Detail Project Plan




                                                 - 14-
Robust Speaker Identification/Verification                                                          Kyaw Thu Ta (H0605351)
For Telephony Applications

                               Robust speaker identification/verification project Gantt chart by week




                                                        Figure 3.1: Project Gantt chart


                                                                 - 15-
Robust Speaker Identification/Verification                          Kyaw Thu Ta (H0605351)
For Telephony Applications

               CHAPTER 4: Design of the Speaker Verification System

4.1       Speech database

          A number of speech database are available, each with its own characteristic. The
major differences between the databases are
         Speech quality
         Speech bandwidth
         Transmission channel
         Recording conditions
         Variation between recording sessions
       Length of utterance
      The database used in our experiments is the TIMIT acoustic-phonetic speech corpus [17].
The speech was recorded at Texas Instruments (TI) and transcribed at Massachusetts Institute
of Technology (MIT). Thus, it is called “TIMIT”. TIMIT was designed to provide speech
data for the acquisition of acoustic-phonetic knowledge and for the development and
evaluation of automatic speaker recognition systems. TIMIT database is the high quality of
the speech, which makes it ideal for testing a technique without the interference of noise and
channel variations.


      The TIMIT database contains a total of 6300 sentences, 10 sentences spoken by each of
the 630 speakers from 8 major dialect regions in the United States. The total of 102 speakers
which each speaker speaks in 10 different speeches was utilized as given by our supervisor.
Out of 102 speakers, 40 speakers were used for training and client evaluation and other 40
speakers (different from training) for impostor evaluation. 20 different speakers with 10
different samples were used for a background model. Although 10 samples are available for
each speaker, we only used 8 samples for client speaker training. For client and impostor
evaluation, 2 samples (different samples from client training) for each speaker were used.
Table 4.1 shows the speaker database arrangement.
                                                                Client            Impostor
 Biometrics      Background           Training
                                                              Evaluation          Evaluation
                 Samples         Models   Samples        Models    Samples   Models    Samples
 Speaker
                 20              40       8              40        2         42        2
                             Table 4.1: Speaker database arrangement


                                                 - 16-
Robust Speaker Identification/Verification                       Kyaw Thu Ta (H0605351)
For Telephony Applications
      Speaker files were converted from „TIMIT‟ format to the standard „.wav‟ format using
convert_Wav and readsph function written by Mr. Mike Brookes, Department of Electrical
and Electronic Engineering, Imperial College. Figure 5.1 shows two speech samples of two
different persons.




                Figure 4.1: Speech samples of two different persons in TIMIT database



4.2      Feature Extraction

         One of the most important aspects of any speech processing system is the extraction
of feature vectors from the speech signal. Feature selection is the process of mapping the
original measurements into a more effective space. Digitised speech usually consists of large
amounts of data. While all this information is required to represent the speech waveform, the
speech process can be represented much less. This is due to the fact that the characteristics of
speech change slowly compared to that of the speech waveform. The goal of feature
extraction is to compress the available information in such a way to select the best features to
represent the speech signal for the application at hand.


         One of the first considerations in speaker recognition system is the selection of what
features to use. Every person has a natural sound quality due to his or her voice pitch.
However, pitch detection has proven challenging and also reliance on it can allow impostors
to gain access by changing their own pitch. The other problem with pitch is that it cannot be
reliably measured in some speech sounds for example and consonants. As such many front-
end algorithms do not seek to use pitch as a specific feature.




                                             - 17-
Robust Speaker Identification/Verification                      Kyaw Thu Ta (H0605351)
For Telephony Applications
           Various features have been developed in the past for the speaker recognition such as
linear prediction coefficients (LPC), cepstral coefficients, mel-frequency cepstral coefficients
(MFCC), etc. The most well-known feature representation, MFCC was adopted in this
project.




4.2.1 Mel-Frequency Cepstral Coefficients (MFCC)

        In this project, Mel Frequency Cepstral Coefficient is used. Mel frequency Cepstral
Coefficients are coefficients that represent audio based on perception. This coefficient has a
great success in speaker recognition application. It is derived from the Fourier Transform of
the audio clip. In this technique, the frequency bands are positioned logarithmically, whereas
in the Fourier Transform the frequency bands are not positioned logarithmically. As the
frequency bands are positioned logarithmically in MFCC, it approximates the human system
response more closely than any other system. These coefficients allow better processing of
data.


        In the Mel Frequency Cepstral Coefficients the calculation of the Mel Cepstrm is
same as the real Cepstrum except the Mel Cepstrum‟s frequency scale is warped to keep up a
correspondence to the Mel scale. The Mel scale is mainly based on the study of observing the
pitch or frequency perceived by the human. The scale is divided into the units mel.


        In order to obtain MFCC coefficients, the input speech signal is windowed and taken
the Discrete Fourier Transform to convert into the frequency domain. In the frequency
domain, a log magnitude of each of the mel frequencies is acquired. With the help of the filter
bank, a mel-scaling or mel-warping is then carried out. Filter bank is commonly implemented
using triangular overlapping filters that are linearly spaced from 0 to 1 kHz and then non-
linearly placed according to the mel-scaling approximations. [27]




                                             - 18-
Robust Speaker Identification/Verification                          Kyaw Thu Ta (H0605351)
For Telephony Applications
Below figure 4.2 shows the general form of the filter bank.




                                    Figure 4.2: General filter bank
The mel-scale is approximated by
                        f 
Mel ( f )  2595log 1                       ----------------------------- (4.1)
                     700 
where Mel ( f ) : the frequency in mels
              f : the input frequency in Hertz


         The center frequencies of the triangular filters are set at the mel-scale frequencies,
with the low input frequencies (less than 1 kHz) given a higher profile than the higher
frequencies. The resultant signal from the filtering is then transformed using an inverse DFT
into the cepstral domain. It is usually implemented with a Discrete Cosine Transform. The
signal in the cepstral domain is measured in quefrencies. The lower order coefficients are
selected as the feature vector to avoid the higher coefficients, which include the pitch. Then
the coefficients are uniformly scaled and used as the output feature vector for that speech
frame.
In this project, MFCC is obtained by the following sequences of processing:
   1. Window the data by using a hamming window
   2. Shift it into FFT (Fast Fourier Transform) order
   3. Find the magnitude of the FFT
   4. Convert the FFT data into filter bank outputs
   5. Find the log base 10
   6. Find the cosine transform to reduce dimensionality
   The filter bank is implemented using 13 linearly-spaced filters followed by 27 log-spaced
filters. 16 MFCCs were extracted from each speaker in this project.



                                             - 19-
Robust Speaker Identification/Verification                                 Kyaw Thu Ta (H0605351)
For Telephony Applications
4.3     Speaker Modeling

        After a waveform is converted to a sequence of vectors to reduce the data rate of the
signal, a model of the speaker must be created. The model is used to obtain an utterance
score, which indicates the correspondence between the utterance and the model.
The utterance scores are processed using hypothesis testing to obtain the final decisions.
There are numerous choices such as neural networks, Support Vector Machines (SVM),
Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM) as well as
combinations of the various approaches. While some models are capable of exploiting
dependency in the sequence of feature vectors from the target speakers, the approaches based
on GMM treat the sequence of feature vectors as independent random vectors.


        The major features that will influence any choice are the amount of data available for
the speaker models and the nature of the verification problem whether is it text-independent
or text-dependent and the level of performance that is required. The current state-of-the-art
speaker model is the Gaussian Mixture Model (GMM). Since TIMIT speech database is text-
independent, we adopted the Gaussian Mixture Model (GMM), which is the one of the most
widely used speaker model.



4.3.1 Gaussian Mixture Model (GMM)
        The Gaussian Mixture Model (GMM) is a density estimator and is one of the most
commonly used types of classifier. The standard training method for GMM models is to use
MAP adaptation of the means of the mixture components based on speech from a target
speaker. The mathematical form of an m component Gaussian mixture for D dimensional
input vectors is,
                                            1
                                                                  x  i   ------------- (4.2)
            m
P  x M    ai
                             1                                1
                                       exp    x  i  
                                                         T
                                                            i               
                                            2                              
                                   1
            i 1         D         2
                     2  2 
                               i

where Px M  is the likelihood of x given the mixture model, M . The mixture model

consists of a weighted sum over m unimodal Gaussian densities each parameterized by the

mean vectors,  i , and covariance matrices,               i .   The coefficients, a i , are the mixture

weights, which are constrained to be positive and must sum to one. The parameters of a


                                                   - 20-
Robust Speaker Identification/Verification                                Kyaw Thu Ta (H0605351)
For Telephony Applications

Gaussian mixture model, ai ,  i and              i   for i  1     m, may be estimated using the

maximum likelihood criterion via and the iterative Expectation-Maximization (EM)
algorithm. In general, fewer than ten iterations of the EM algorithm will provide sufficient
parameter convergence.


         The complete Gaussian mixture PDF is represented by the mean vector, covariance
matrices and mixture weights of all the component densities [2]. These parameters are
collectively represented by the notation


 ai , i , i    with i  1,......, M ,               …………………… (4.3)

where  is the GMM for each speaker



4.3.2 Maximum Likelihood Training
         The other task to estimate the parameters of GMM  , which best matches the
distribution of the training feature vectors, given by speech of the speaker. There are several
available techniques for GMM parameters estimation. The most popular method is maximum
likelihood (ML) estimation. The basic idea of this method is to find model parameters which

maximize the likelihood of GMM. For a given set of T training vectors X  x1 ,                   
                                                                                                 , xT   
GMM likelihood can be written:
                                         T
                                              
                            p  X     p  x1             -------------- (4.4)
                                        t 1      
         ML parameter estimates can be obtained iteratively using special case of expectation-
maximization (EM) algorithm. There the basic idea is, beginning with initial model  , to

                                             
estimate a new model  , that p X   p  X   . The new model then becomes the initial

model for the next iteration. This process is repeated until some convergence threshold is
reached [21].
         On each iteration, reestimation formulas are used: mixture weights are recalculated.

         ai 
                1 T
                              
                   p i xt ,  . --------------------- (4.5)
                T t 1




                                                  - 21-
Robust Speaker Identification/Verification                                                    Kyaw Thu Ta (H0605351)
For Telephony Applications
Means are recalculated.

                   p i x ,   x
                   T
                                        t           t

        i       t 1
                                                        . -------------------- (4.6)
                     p i x ,  
                     T
                                            t
                    t 1


Variances are recalculated.

                    p i x ,    x   
                       T
                                                                    2
                                            t           t       i
              2
        i         t 1
                                                                        . ----------- (4.7)
                                 p i x ,  
                                T
                                                    t
                                t 1


The a posteriori probability for acoustic class i is given by:

                         . ----------------------- (4.8)
                       pi bi xt
         
p i xt ,  
                   p b x 
                  M

                            k    k              t
                  k 1




4.4     Speaker Match Score

        The approach to speaker verification is to apply a likelihood ratio test to the input
utterance. In this phase, Multi-Gaussian log-likelihood ratio tests are performed. And
f  X | C  , the likelihood that the utterance was produced by C is computed as
                                  N
                           1
ln f  X | C  
                           N
                                 ln f  x |  
                                 t 1
                                                        i   C



                                     M                  1  xt  Ci T Ci1  xt  Ci 
                           1 N              m
                                 ln   D / 2 Ci 1/ 2  e 2
                              i1  2  
                           N t 1
                                                                                              ------------ (4.9)
                                                Ci    
where mCi : i ' th mixture weight

        Ci : mean vector
        Ci : covariance matrix of the claimed speaker model, C .

The likelihood that the utterance was not produced by the speaker is determined from a
collection of background speaker models. With a set of B background models the likelihood
is calculated as follow

      C
              1 B
f X |  _  ln   ln f X |  
                B b1
                               
                             b                                    


                                                                               - 22-
Robust Speaker Identification/Verification                                             Kyaw Thu Ta (H0605351)
For Telephony Applications

                   1 B       1 T      M 
                                                    mbi           1  xt  bi T bi1  xt  bi   
               ln            ln                          e 2                                     ---------- (4.10)
                    B b 1    N t 1 i 1  2 D / 2 bi 1/ 2                                       
                                                                                                   
where m : i ' th mixture weight
       bi
       bi : mean vector

        Ci : covariance matrix of the background speaker model, C .

       For the speaker verification case, a subset of background speakers must be found from
within the set of available speaker models. The two main issues in choosing background
speakers are the number of speakers and the selection criterion. The ideal choice would be to
use all the speakers as background models, but this may not be a computationally feasible
solution in larger speaker sets where it will take very long and complex computation.
Therefore, a tradeoff between the background size and computational efficiency must be
made. In this project, 20 models with 10 sample speeches, which are different from the
training, are used as a background model.
The match score can be attained as follow:

                                  
Score  ln f  X | C   ln f X |  _
                                          C
                                                           ---------------- (4.11)

where f  X | C  = Likelihood that the utterance was produced by C and

                     
        f X |  _ = Likelihood that the utterance was not produced by the claimed speaker
                  C


       The thresholds are set to create a tradeoff between rejecting the true claimants, which
is also known as the false rejection error and accepting false claimants, known as the false
acceptance error. If the match score is higher than the set threshold, the claimed speaker will
be accepted as a genuine or else it will be treated as an impostor.

4.5    Outlines of speaker verification system algorithms

       In our project, there are four key algorithms namely: training, background training as
well as client and impostor evaluation algorithms for the speaker verification system.


Speaker training algorithm is constructed as follows:
1. Load speakers from training database
2. Find and extract features (MFCC) for each training speaker sample
3. Find the mixture weights, means and variances for each training speaker model



                                                        - 23-
Robust Speaker Identification/Verification                      Kyaw Thu Ta (H0605351)
For Telephony Applications
4. Save training mixture weights, means and variances for verification algorithms


Speaker background model algorithm is constructed as follows:
1. Load speakers from background database
2. Find and extract features (MFCC) for each background speaker sample
3. Find the mixture weights, means and variances for a background speaker model
4. Save background mixture weights, means and variances for usage in verification
   algorithms


Client evaluation algorithm is constructed as follows:
1. Load speakers from client evaluation database
2. Find and extract features (MFCC) for each client speaker sample
3. Find the likelihood of background model using client feature, background mixture
   weights, means and variances
4. Find the likelihood of client model using client feature, training mixture weights, means
   and variances
5. Find the client score = likelihood of client model - likelihood of background model
6. Determine the number of correct clients with the threshold
7. Determine the false rejection rate, FRR %
8. Save client score and FRR %


Impostor evaluation algorithm is constructed as follows:
1. Load speakers from impostor evaluation database
2. Find and extract features (MFCC) for each impostor speaker sample
3. Find the likelihood of background model using impostor feature, background mixture
   weights, means and variances
4. Find the likelihood of impostor model using impostor feature, training mixture weights,
   means and variances
5. Find the impostor score = likelihood of impostor model - likelihood of background model
6. Determine the number of wrongly acceptance with the threshold
7. Determine the false acceptance rate, FAR%
8. Save impostor score and FAR%


Full algorithms are attached in the appendix B.


                                             - 24-
Robust Speaker Identification/Verification                        Kyaw Thu Ta (H0605351)
For Telephony Applications

                           CHAPTER 5: Experimental results

5.1    Performance measure of the speaker verification systems

       Typically, performance of speaker recognition system is usually referred to in terms
of the false accept rate (FAR), the false non match or reject rate (FRR) or Genuine
Acceptance Rate (GAR). In a system, the FAR measures the percent of invalid users who are
incorrectly accepted as genuine users. Normally, FAR is the ratio of the number of false
acceptance by the number of imposters test. Alternatively, the FRR measures the percent of
valid users who are rejected as impostors. FRR is the ratio of the number of false rejections
by the number of verification tests. GAR is used an alternative to FRR. And Total Error Rate
(TER) is the combination of false reject and false accept rate.


       A client score measures the similarity between the trained and test samples of the
same person. An impostor score refers to the similarity measure of samples for different
persons. In theory, client scores should always be higher than that of the impostors. If that is
the case, a single threshold that separates the two groups of scores could be used to separate
between clients and impostors [28]. For instance, if a client score is less than the threshold,
the verification system will count as a false rejection. If an impostor score is greater than the
threshold, it will reckon as a false acceptance.


In this project, FRR, FAR, GAR and TER are determined as follow:


              false rejection numbers
FAR  %                             100%           ………………. (5.1)
               number of client tests

GAR  %  100%  FRR%                                ………………. (5.2)

              false accep tan ce numbers
FAR  %                                100%        ………………. (5.3)
              number of impostor tests

TER  %  FRR  %  FAR  %                 ………………. (5.4)


       Depend on the choice of threshold value, the distribution of the client and impostor
scores overlaps as shown in following figure 5.1.




                                             - 25-
Robust Speaker Identification/Verification                     Kyaw Thu Ta (H0605351)
For Telephony Applications




                   Figure 5.1: Client and impostor scores overlapping illustration



       If the score distributions overlap, the FAR and FRR curve intersect at a certain point
as shown in figure 5.2. This intersection point is known as Equal Error Rate (EER). The EER
of a biometric system can be used to give a threshold independent performance measure. The
lower the EER value, the higher the accuracy of the verification system.




                                 Figure 5.2: Equal Error Rate (EER)




                                             - 26-
Robust Speaker Identification/Verification                           Kyaw Thu Ta (H0605351)
For Telephony Applications


5.2       Experiment Approach

          Evaluation of speaker verification, simulation of telephone quality speech speaker
verification, simulation of speaker verification by adaptive filtering the telephone quality
speech and the effect of the white noise on clean speech have been experimented with
MATLAB 7.1 version. As mentioned above, the speaker verification system was tested using
Gaussian Mixture Model approach with TIMIT database.


          For the simulation of telephone quality speech speaker verification, the uncompressed
and converted clean speech is passed through the Low-Pass filter with cut-off frequency of
3.4 kHz. For the next experiment, the adaptive filter was trained with reference to clean
speech in order to get the coefficient of the filter. After passing through the adaptive filter, the
telephone quality speech was tested again for the speaker verification. As well as the impact
of white noise over speaker verification was evaluated. The results of each experiment are
discussed in the following section.



5.3       Speaker Verification Experimental results

          In this project, the speaker identification/verification is tested as follow
                 Clean speech training and clean speech testing
                 Telephone quality speech training and telephone quality speech testing
                 Clean speech training and telephone quality speech testing
                 Clean speech training and the speech after inverse filtering the telephone
                  quality speech testing
                 Clean speech training and speech adding noise to clean speech testing

5.3.1 Clean speech training and clean speech testing


                              Background                     Impostor
            Training Model                  Client Model
Test No:                         Model                        Model    Threshold FRR%     FAR%     TER%
             (20 samples)                    (2 sample)
                              (2 samples)                  (2 samples)
      1            1               10             1             81        3.2       0     3.2967   3.2967
      2           20               10            25             62       2.78       5     4.8611   9.8611
      3           40               10            40             42        2.5     3.75    4.8077   8.5577

      Table 5.1: Experimental results (testing) for clean speech speaker verification system



                                               - 27-
Robust Speaker Identification/Verification                       Kyaw Thu Ta (H0605351)
For Telephony Applications


                            Background                     Impostor
           Training Model                 Client Model
Test No:                       Model                        Model    Threshold FRR%    FAR%    TER%
            (20 samples)                  (2 samples)
                            (2 samples)                  (2 samples)
   4             1               10             1             81         4       0    2.4691   2.4691
   5            20               10            25             62       2.18      5    4.0323   9.0323
  6-1           40               10            40             40        3.1     2.5     2.5       5
  6-2           40               10            40             42        3.1     2.5    2.81     5.31

     Table 5.2: Experimental results (final) for clean speech speaker verification system


        From the experiment, it showed that the threshold value is directly proportional to
FRR and inversely proportional to FAR and the performance of verification results (Total
Error Rate) is also inversely proportional to the number of samples used for training or
testing. After fine-tuning the threshold setting on numerous occasions, the threshold value is
set where False reject rate and false accept rate are almost equal. The results for clean speech
testing are shown Table 5.1 and 5.2.




                     Figure 5.3: ROC for clean speech speaker verification system




                                             - 28-
Robust Speaker Identification/Verification                      Kyaw Thu Ta (H0605351)
For Telephony Applications




                                        EER




                     Figure 5.4: EER for clean speech speaker verification system


       Figure 5.4 illustrates the intersection of FRR and FAR for the clean speech speaker
verification system. The EER is about 2.5% at threshold value of 3.1082. Hence, the total
error rate is 4.881%.




5.3.2 Telephone quality speech training and telephone quality speech testing




      Clean speech                 Low-pass filter            Simulated telephone
         signal                   (cut-off 3.4 kHz)             quality speech


                        Figure 5.5: Flow diagram for telephone channel simulation


       In order to close to real life telephone applications, Low-pass Filter was designed to
get the simulated the telephone quality speech. IIR Butterworth Low-pass Filter with the
order of 40 was used. The frequency response of the low pass filter is shown in figure 5.6.
The detail of filter design can be seen in the appendix A.




                                             - 29-
Robust Speaker Identification/Verification                            Kyaw Thu Ta (H0605351)
For Telephony Applications




                          Figure 5.6: Frequency response of the Low-Pass Filter




                          Figure 5.7: Clean speech and telephone quality speech


       After clean speech is passed through the low-pass filter, the frequency components
higher than 3.4 kHz were removed away as show in figure 5.7.

                             Background                    Impostor
           Training Model                  Client Model
Test No:                       Model                        Model       Threshold FRR%   FAR%    TER%
           (20 samples)                    (2 sample)
                             (2 samples)                  (2 samples)
   1             1               10             1             81          2.5      0     20.9877 20.9877
   2            20               10            25             62          3.15   22.5    24.1935 56.6935
   3            40               10            40             42          3.6    16.25   15.4762 31.7262


  Table 5.3: Experimental results for telephone quality speech speaker verification system
        With the simulation of telephone quality speech speaker verification system, it can be
seen that the Total Error Rate (TER) has increased quite significantly compared to Table 5.2.


                                             - 30-
Robust Speaker Identification/Verification                   Kyaw Thu Ta (H0605351)
For Telephony Applications




             Figure 5.8: ROC for telephone quality speech speaker verification system




                                              EER




             Figure 5.9: EER for telephone quality speech speaker verification system


       Figure 5.9 show the EER for telephone quality speech speaker verification system
with the EER of around 15.85% with the threshold of 3.591.




                                             - 31-
Robust Speaker Identification/Verification                            Kyaw Thu Ta (H0605351)
For Telephony Applications

5.3.3 Clean speech training and telephone quality speech testing

                            Background                     Impostor
           Training Model                  Client Model
Test No:                      Model                         Model       Threshold FRR%   FAR%     TER%
           (20 samples)                     (2 sample)
                            (2 samples)                   (2 samples)
   1             1              10              1             81          0.5     0      33.9506 33.9506
   2            20              10              25            62          3.49    60     60.4839 120.4839
   3            40              10              40            42         13.45   47.5    47.619   91.119


 Table 5.4: Experimental results of speaker verification system for telephone quality speech
                                     testing with clean database

       In Table 5.4, the simulation is done with the clean database for training phase. As the
telephone quality speech samples were used to determine the client and impostor, the total
error rate is drastically increased compared to Table 5.3.


       The following Figure 5.10 and 5.11 illustrates the ROC and EER of the telephone
quality speech testing with the clean database.




              Figure 5.10: ROC for telephone quality speech testing with clean database




                                              - 32-
Robust Speaker Identification/Verification                         Kyaw Thu Ta (H0605351)
For Telephony Applications




                                                             EER




              Figure 5.11: EER for telephone quality speech testing with clean database


       Figure 5.11 illustrates the intersection of FRR and FAR for simulation of telephone
quality speech testing with clean database. The EER is very high and is about 47.5806% at
threshold value of 13.4725. Hence, the total error rate is 91.119%.



5.3.4 Clean speech training and speech after inverse filtering telephone quality
       speech testing


  Simulated telephone                      Inverse filter                    Desired speech
    quality speech                  (trained by adaptive filter)             (Approximate
                                                                             clean speech)

                           Figure 5.12: Flow diagram for inverse filtering



       The purpose of the inverse filter is to get the better recognition accuracy, after clean
speech is passed through the simulated telephone channel. Instead of inverse filter, the
adaptive filter is used and trained by using Least Mean Square (LMS) algorithm as show in
figure 5.13. The details of adaptive filter training can be seen in Appendix A.




                                             - 33-
Robust Speaker Identification/Verification                     Kyaw Thu Ta (H0605351)
For Telephony Applications

        Clean speech
                                                             X(n)              e(n)
                                                                      +
                                                                          -


       Telephone Speech                Adaptive filter       y(n)




                            Figure 5.13: Flow diagram of adaptive filter
       After training phase of the adaptive filter, the frequency response of the filter is
shown in figure 5.14.




                          Figure 5.14: Frequency response of adaptive filter




            Figure 5.15: Telephone quality speech and the speech after inverse filtering




                                             - 34-
Robust Speaker Identification/Verification                             Kyaw Thu Ta (H0605351)
For Telephony Applications
                             Background                     Impostor
           Training Model                   Client Model
Test No:                        Model                        Model       Threshold FRR%   FAR%    TER%
           (20 samples)                      (2 sample)
                             (2 samples)                   (2 samples)
   1             1               10              1             81          0.05     0     24.6914 24.6914
   2            20               10              25            62          6.03   27.5    27.4194 54.9194
   3            40               10              40            42          2.01    20     19.0476 39.0476


   Table 5.5: Experimental results of speaker verification system for speech after inverse
                     filtering the telephone quality speech with clean database

       After passing through the inverse filter, speaker verification accuracy is better
compared to Table 5.4. But the total error rate is higher than the clean speech speaker
verification from Table 5.2




           Figure 5.16: ROC for speech after inverse filtering the telephone quality speech
                                      testing with clean database




                                               - 35-
Robust Speaker Identification/Verification                      Kyaw Thu Ta (H0605351)
For Telephony Applications




                                                     EER




           Figure 5.17: EER for speech after inverse filtering the telephone quality speech
                                 testing with clean database
       Figure 5.17 illustrates the intersection of FRR and FAR for simulation of speech after
inverse filtering the telephone quality speech testing with clean database. The EER is very
high and is about 20% at threshold value of 2.01. Hence, the total error rate is 39.0476%.



5.3.5 Clean speech training and speech adding noise to clean speech testing


                                     White noise


                    Clean speech signal        Noise speech signal
                                          +


                     Figure 5.18: Flow diagram of adding noise to clean speech


       To understand the effect of white noise, the clean speech is added with white noises,
of which signal to noise ratio (SNR) measure from the input side. Various SNR (10dB, 15
dB, 20dB, 25dB, 30dB) are carried out in the next experiments. Figure 5.19 show the power
spectral of clean speech and speech with white noise.


                                             - 36-
Robust Speaker Identification/Verification                             Kyaw Thu Ta (H0605351)
For Telephony Applications




                          Figure 5.19: Clean speech and speech with white noise

                             Background                     Impostor
           Training Model                   Client Model
Test No:                       Model                         Model       Threshold FRR%   FAR%    TER%
           (20 samples)                      (2 sample)
                             (2 samples)                   (2 samples)
   1            20               20              20            20          2.39    55      55     110
   2            40               20              40            40          2.35   46.25   46.25   92.5


 Table 5.6: Experimental results of speaker verification system for SNR 10 dB noise speech
                                      testing with clean database


       After adding noise to clean speech, from table 5.6, it can be seen that total error rate
much higher than clean speech speaker verification system.




             Figure 5.20: ROC for noise speech (SNR 10 dB) testing with clean database




                                               - 37-
Robust Speaker Identification/Verification                             Kyaw Thu Ta (H0605351)
For Telephony Applications




                                                          EER




              Figure 5.21: EER for noise speech (SNR 10 dB) testing with clean database


       Figure 5.19 show the EER for noise speech speaker verification system with the EER
of around 46.25% and the threshold of 2.351.

       As the next experiment, SNR 15 dB, 20 dB, 25dB, 30 dB were carried out. The results
of the speaker verification system are shown in Table 5.7, 5.8, 5.9, and 5.10.



                            Background                      Impostor
           Training Model                  Client Model
Test No:                      Model                          Model       Threshold FRR%   FAR%    TER%
           (20 samples)                     (2 sample)
                            (2 samples)                    (2 samples)
   3            20              20              20              20         1.77   52.5    52.5    105
   4            40              20              40              40         2.61   38.75   38.75   77.5


 Table 5.7: Experimental results of speaker verification system for SNR 15 dB noise speech
                                     testing with clean database



                            Background                      Impostor
           Training Model                  Client Model
Test No:                      Model                          Model       Threshold FRR%   FAR%    TER%
           (20 samples)                     (2 sample)
                            (2 samples)                    (2 samples)
   5            20              20              20              20         2.26    50      50     100
   6            40              20              40              40         2.64   33.75    35     68.75


 Table 5.8: Experimental results of speaker verification system for SNR 20 dB noise speech
                                     testing with clean database



                                              - 38-
Robust Speaker Identification/Verification                            Kyaw Thu Ta (H0605351)
For Telephony Applications


                            Background                     Impostor
           Training Model                  Client Model
Test No:                      Model                         Model       Threshold FRR%   FAR%    TER%
           (20 samples)                     (2 sample)
                            (2 samples)                   (2 samples)
   7            20              20              20            20         3.228    45     47.5    92.5
   8            40              20              40            40          3.43   26.25   26.25   52.5


 Table 5.9: Experimental results of speaker verification system for SNR 25 dB noise speech
                                     testing with clean database



                            Background                     Impostor
           Training Model                  Client Model
Test No:                      Model                         Model       Threshold FRR%   FAR%    TER%
           (20 samples)                     (2 sample)
                            (2 samples)                   (2 samples)
   9            20              20              20            20          3.26   37.5    37.5     75
   10           40              20              40            40          4.08   11.25   11.25   22.5


Table 5.10: Experimental results of speaker verification system for SNR 30 dB noise speech
                                     testing with clean database

        From Table 5.6, 5.7, 5.8, 5.9, 5.10, it can be concluded that the higher the SNR, the
better result for speaker verification system. Figure 5.22 and 5.23 show the ROC and EER of
the speaker verification system with clean database and noise speech at SNR 30 dB.




             Figure 5.22: ROC for noise speech (SNR 30 dB) testing with clean database



                                              - 39-
Robust Speaker Identification/Verification                       Kyaw Thu Ta (H0605351)
For Telephony Applications




                                          EER




                Figure 5.23: EER for noise speech (SNR 30 dB) testing with clean database
       Figure 5.23 show the EER for noise speech speaker verification system with the EER
of around 11.25% and the threshold of 4.08.



5.4    Comparison of Experimental Results



                                  Experimental Results Summary
No:             Simulation Methods                   Threshold   FRR (%)    FAR (%)      TER (%)
 1                 Clean Speech                         3.1        2.5        2.81        4.881

 2           Telephone Quality Speech                   3.6       16.25     15.4762      31.7262
          Clean Database with Telephone
 3                                                    13.45       47.5       47.619       91.119
                 Quality Speech
      Clean Database with speech after inverse
 4                                                     2.01        20       19.0676      39.0476
        filtering the telephone quality speech


                         Table 5.11: Experimental Results Summary


       In result summary table 5.11, the performances of simulation of speaker verification
methods were compared by the total error rate %. Among the simulation methods, the clean
speech speaker verification is the best with the better total error rate (%) of about 27%, 87%
and 35% respectively. From the telephone quality simulation results, it can be found that


                                             - 40-
Robust Speaker Identification/Verification                        Kyaw Thu Ta (H0605351)
For Telephony Applications
degradation of speech signal over telephone lines has significantly impact on speaker
verification. From the result table 5.11, training with telephone quality speech gives better
verification accuracy than training with clean database.


       To improve on the verification accuracy, the inverse filter was used to get back close
to the clean speech after the telephone channel simulation. After inverse filtering the
telephone quality speech, the total error rate (%) is significantly increased about 52%. But
the total error rate is not much different from training with telephone quality speech. From
the table 5.11, inverse filtering method give slightly higher of the total error rate about 8%.


                        Experimental Results Summary for noise speech
No:             Simulation Methods                   Threshold    FRR (%)       FAR (%)      TER (%)
      Clean data base with noise speech at SNR
 1                                                     2.35        46.25         46.25            92.5
                       10dB
      Clean data base with noise speech at SNR
 2                                                     2.61        38.75         38.75            77.5
                       15dB
      Clean data base with noise speech at SNR
 3                                                     2.64        33.75           35         68.75
                       20dB
      Clean data base with noise speech at SNR
 4                                                     3.43        26.25         26.25            52.5
                       25dB
      Clean data base with noise speech at SNR
 5                                                     4.08        11.25         11.25            22.5
                       30dB

 6    Clean data base with clean speech testing         3.1          2.5          2.5              5



                 Table 5.12: Experimental Results Summary for noise speech


       After the noise was added to clean speech, the total error rate is notably higher than
testing with clean speech. Among the simulation methods, speaker verification system for
clean data base with noise speech at SNR 10 dB give the highest total error rate. The total
error rate for speaker verification at SNR 10 dB, 15dB, 20dB, 25dB, 30dB gives the higher
total error rate (%) of about 87%, 72%, 63%, 47% and 17% respectively. From the result
summary for noise speech table 5.12, it was observed that the higher the signal to noise ratio,
the better accuracy for the speaker verification.




                                             - 41-
Robust Speaker Identification/Verification                       Kyaw Thu Ta (H0605351)
For Telephony Applications

                   CHAPTER 6: Conclusion and Recommendation

6.1    Conclusion

       In this project, in order to get closer to the real life telephone applications, various
circumstances were simulated and tested with spoken speech database (TIMIT) for accuracy
of speaker verification system. Firstly, speaker distinct feature extracted by MFCC feature-set
feature-vectors, which is computed in the cepstral domain, is used for speaker verification.
GMM approach is used for training and evaluating of the client and impostor. As a first
experiment, clean speech speaker verification system was implemented to achieve match
scores for the system. The telephone quality speech was simulated and tested the speaker
verification system with clean database and telephone quality speech database.


       From the results of speaker verification system, the accuracies of other simulated
speech speaker verification are far behind than the clean speech speaker verification. After
the simulated telephone channel, accuracy for speaker verification is dropped considerably.
This is because of the bandwidth of the speech, which was reduced from 16 kHz to 3.4 kHz.
It indicates that there are still useful frequency components present which is higher than 3.4
kHz. To recover these frequency components, the inverse filter is designed and tested over
the speaker verification system. The accuracy for speaker verification by using inverse
filtering is much better than the speaker verification with clean database and telephone
quality speech testing. But it gives about the same performance with telephone quality speech
database.


       As on next experiment, the noise is added and measured on the performance of the
speaker verification. It was observed that the lower the signal to noise ratio the higher impact
on the accuracy of the system. The speaker verification with SNR 30 dB gives the best
performance, among the noise speech tests but the accuracy is lower compared to clean
speech speaker verification as expected.


       The experimental results of clean speech simulation achieved in this project are
comparable to current state of the art speaker verification systems results. Thus, it is generally
concluded the project has been accomplished its objectives.




                                             - 42-
Robust Speaker Identification/Verification                        Kyaw Thu Ta (H0605351)
For Telephony Applications


6.2       Recommendations for future study


          In the field of speaker recognition, study of robust feature techniques has been carried
on. If not for time constraints, we would like to find out the results under different
circumstances and try out different feature techniques.


To find out the accuracy of the speaker verification system on noisy telephone channel
         Adding white noise on the simulated telephone quality speech.
         Implementing the noise cancellation filter at the speaker verification system.


To further enhance the performance of the speaker verification system:
         Evaluate feature extraction with delta MFCC approach.
         Evaluate feature extraction with LP cepstrum approach
         Use discriminative classifier like Support Vector Machine (SVMs) for speaker
          verification in place of GMM approach
         Compressing the training data using vector quantization (VQ) techniques


To find out the speaker verification on the Internet communications
         Evaluate Voice Over Internet Protocol (VOIP) quality speech simulation, with the
          consideration of microphone quality, packet-based transmit and receive data and lost
          of packet over the communication channel




                                             - 43-
Robust Speaker Identification/Verification                        Kyaw Thu Ta (H0605351)
For Telephony Applications

                                           Part 2:
                             Critical Review and Reflections
       Even though many researches have been carried out on the speaker recognition,
speaker identification/verification research is a new and challenging subject for me.
Therefore, as the beginning of the project, Literature research on the overview of the
biometric authentication and speaker recognition were carried out. Finding out the origin and
nature on this topic have never been easy. With the help of project workshop provided by
school, searching of reference materials was relatively easy. Lee Kong Chian reference
library, Ngee Ann Polytechnic library, IEEE technical papers and World Wide Web help me
a lot on observing my literature research. But, it is difficult to understand most of the research
papers at once. After using up more than one month for literature research, my
understandings on the project and technical paper reading skill have distinctly improved.


       As on next assignment, I prepared my project initial report, which consists of project
objectives, proposed approaches and methods to be employed, project management,
investigation of project background as well as skill review. In order to complete the project
objectives successfully, proposed approaches were systematically analyzed and selected at
the initial stage. Project plan was also scheduled with details


       For the approach to the project work, firstly, the TIMIT wave files were un-
compressed and converted to listenable wave files. And then the evaluations of speaker
verification system were implemented. Theories and algorithms of GMM recognition
approaches were also studied. Even though we had learnt the basic of MATLAB
programming throughout the course of my study, MATLAB codes of algorithms and data
flow were far more complex to understand. After trail and error on the problem as well as
practicing, my knowledge on algorithms gradually gets better over time.


       And then, the existing algorithms were modified to suit with our speaker database and
implemented to get the better accuracy, which is less than 10% total error rate. After that, the
wave files are converted according to the telephone quality speech and evaluated on speaker
verification system. Followed by the inverse filtering of the telephone quality speech to
distinctly recover the clean speech were carried out and tested. And also the effect of
Gaussian white noise on clean speech was evaluated on speaker verification system.


                                             - 44-
Robust Speaker Identification/Verification                          Kyaw Thu Ta (H0605351)
For Telephony Applications
        As I am new to the field of speaker recognition, so many obstacles were encountered
along the project. The first problem encountered was the uncompressing of TIMIT database
raw speech files. But after referring to the old project provided by the project supervisor, the
uncompressing and conversion of raw speech was completed within the time frame. The
adjustment of source code for speaker verification system was finished on time but the
achievable total error rate was around 10% which was not satisfactory result for accuracy.
After increasing the back ground model training from 10 models to 20 models, total error rate
was reduced to around half at the speaker verification system with 40 client models.


        At the telephone quality speech simulation, first we tested on accuracy by training
with telephone quality speech and later by clean speech. No major problems were faced on
that evaluation. On the next evaluation, to recover the clean speech after telephone
simulation, we constructed the inverse filter to recover the original speech. To get the inverse
filter, the initial low pass filter must be stable which all the poles and zeros are within the unit
circle and minimum phase. But the poles of IIR low-pass filter we constructed are not stable.
As a result, the project was delayed for about 2 weeks to find out the correct method. After
taking advice from my project supervisor, Dr. Sirajudeen Gulam Razul, the problem was
solved by using the adaptive filter to train the coefficients of the inverse filter.


        For the speaker verification on noise speech, the difficulty faced was the constructing
of noise speech with SNR 10dB, 15dB, 20dB, 25dB and 30dB for test speech. After referring
back to lab experiment from adaptive signal processing, the noise speeches with different
SNR were constructed. From experimental results, it can be seen that threshold values are
directly proportional to FAR and inversely proportional to FRR. And the telephone quality
speech is more error prone compared to clean speech simulation.


From this project, I have learnt new skills like drawing a Gantt chart and ROC curve. My
existing skills such as MATLAB programming, research, analytical, problem solving, project
and time management and technical report writing were improved significantly. In short, the
project has equipped me with technical and critical thinking skills.




                                             - 45-
Robust Speaker Identification/Verification                     Kyaw Thu Ta (H0605351)
For Telephony Applications

                                      Bibliography
[1] Sadaoki Furui, “Digital Speech Processing, Synthesis, and Recognition”, Second
    Edition, Revised and Expaned”, Marcel Dekker, Inc. 2001.
[2] Douglas A. Reynolds, Richard C. Rose, “Robust Text-Independent Speaker
    Identification Using Gaussian Mixture Speaker Models”, IEEE Transactions on Speech
    and Audio Processing, VOL. 3, No. 1, January 1995
[3] Ji Ming, Timothy J. Hazen, James R. Glass and Douglas A. Reynolds, “Robust Speaker
    Recognition in Noisy Conditions”, IEEE Transactions on Audio, Speech, and Language
    Processing, VOL. 15, July 2007
[4] A. Martin and M. Przybocki, “The NIST Speaker Recognition Evaluation Series,”
    National Institute of Standards and Technology [Online].
    Available: http://www.nist.gov/speech/tests/spk
[5] Joaquin Gonzalez-Rodriguez, Javier Ortega-Garcia, Cesar Martin and Luis Hernandez,
    “Increasing Robustness in GMM Speaker Recognition Systems for Noisy and
    Reverberant Speech with Low Complexity Microphone Arrays” [Online]. Available:
    http://www.asel.udel.edu/icslp/cdrom/vol3/869/a869.pdf
[6] Vaclav Matyas and Zdenek Rıha, “Biometric Authentication __ Security and Usability”
    Faculty of Informatics, Masaryk University Brno, Czech Republic [Online]
    Available: http://www.fi.muni.cz/usr/matyas/cms_matyas_riha_biometrics.pdf
[7] Gintaras Barisevicius, “Text-Independent Speaker Verification”, Department of
    Software Engineering, Kaunas University of Technology, Kaunas, Lithuania [Online].
    Available:http://www.speech.kth.se/~rolf/NGSLT/gslt_papers_2004/barisevicius_term_
    paper.pdf
[8] Alexandre Preti, Nicolas Scheffer and Jean-Francois Bonastre, Discrimant Approaches
    for GMM Based Speaker Detection Systems”, University of Avignon, Avignon, France.
[9] Vincent Wan, “Speaker Verification using Support Vector Machines”, Department of
    Computer Science, University of Sheffield, United Kingdoms. [Online]. Available:
    http://www.dcs.shef.ac.uk/~vinny/docs/pdf/thesis.pdf
[10] Yiying Zhang, David Zhang, and Xiaoyan Zhu, “A Novel Text-Independent Speaker
    Verification Method Based on the Global Speaker Model”, IEEE Transactions on
    Systems, Man And Cybernetics—PART A: Systems and Humans, VOL. 30, NO. 5,
    September 2000




                                             - 46-
Robust Speaker Identification/Verification                     Kyaw Thu Ta (H0605351)
For Telephony Applications
[11] Bing Xiang, “Text-Independent Speaker Verification with Dynamic Trajectory Model”,
    IEEE Signal Processing Letters, VOL. 10, NO. 5, May 2003
[12] Bing Xiang and Toby Berger, “Efficient Text-Independent Speaker Verification with
    Structural Gaussian Mixture Models and Neural Network”, IEEE Transactions on
    Speech and Audio Processing, VOL. 11, NO. 5, September 2003
[13] Vincent Wan and Steve Renal, “Speaker Verification Using Sequence Discriminant
    Support Vector Machines”, IEEE Transactions on Speech and Audio Processing, VOL.
    13, NO. 2, March 2005
[14] Nengheng Zheng, Tan Lee and P. C. Ching, “Integration of Complementary Acoustic
    Features for Speaker Recognition”, IEEE Signal Processing Letters, VOL. 14, NO. 3,
    March 2007
[15] Benoit G. B. Fauve, Driss Matrouf, Nicolas Scheffer, Jean-Francois Bonastre, John S. D.
    Mason, “State-of-the-Art Performance in Text-Independent Speaker Verification
    Through Open-Source Software”, IEEE Transactions on Audio, Speech and Language
    Processing, VOL. 15, NO. 7, September 2007
[16] Julie A. Jacko, Andrew Sears, ‘The Human Computer Interaction Handbook:
    fundamentals, evolving technologies, and emerging applications”, Lawrence Erlbaum
    Associates, 2003.
[17] Jan Dool “ Investigation of the impact of High Frequency transmitted speech on Speaker
    Recognition”, Master of Electronic Engineering report, University of Stellenbosch, pp.
    36-45, March 2002.
[18] Xu Hefeng, “Biometric Fusion Studies”, Master of Science in Signal Processing report,
    Nayang Technological University, 2006.
[19] Pedro J. Moreno and Richard M. Stern, “Sources of Degradation of Speech Recognition
    In The Telephone Network”, Department of Electrical and Computer Engineering,
    School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.
[20] Qin Jin “Robust Speaker Recognition”, Doctor Philosophy in Language and Information
    Technology, Carnegic Mellon Unveristy, 2007.
[21] J.Kamarauskas “Speaker Recognition using Gaussian Mixture Models” Institute of
    Mathematic and Informatics
[22] http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
[23] http://www.scholarpedia.org/article/Speaker_recognition
[24] http://www.biometrics.org/html/introduction.html
[25] http://www.biometrics.gov/ReferenceRoom/Introduction.aspx


                                             - 47-
Robust Speaker Identification/Verification                    Kyaw Thu Ta (H0605351)
For Telephony Applications


[26] http://www.globalsecurity.org/security/systems/biometrics.htm
[27] http://www.lsv.uni-saarland.de/dsp_ss05_chap9.pdf
[28]. http://www.bioid.com/sdk/docs/About_EER.htm
[29]. http://www.scholarpedia.org/article/Biometric_authentication
[30]. http://cslu.cse.ogi.edu/HLTsurvey/ch1node9.html
[31]. http://www.speaker-recognition.org/




                                             - 48-
Robust Speaker Identification/Verification            Kyaw Thu Ta (H0605351)
For Telephony Applications

                                     APPENDICES

                      Appendix A: Wave Conversion Algorithms

A.1    Clean Speech Wave Conversion Algorithm
%****** Program for uncompressed and conversion of TIMIT '.wav' file
******%
close all
clear all

InputPath = 'E:\FYP\Testing\DR2\MWVW0';
InputName = 'SA1.wav';
OutPutPath= 'E:\FYP\Testingout\MWVW0';
OutPutName= 'SA1out.wav';
Rate = 16000;
Amplifier = 10;
SUCCEED = Convert_Wav(InputPath, InputName, OutPutPath, OutPutName, Rate,
Amplifier);
------------------------------------------------------------
%Convert_Wav Function used in Conversion of training Data to wave output%
%Convert from TIMIT 'In_FileName' to Standard .wav named 'OUT_FileName'
%SUCCEED == 1 means OK
function SUCCEED = Convert_Wav(InputPath, InputName, OutPutPath,
OutPutName, Rate, Amplifier)
SUCCEED = 0;
In_FileName = sprintf('%s\\%s', InputPath, InputName);
Sound = readsph(In_FileName);
OUT_FileName = sprintf('%s\\%s', OutPutPath, OutPutName);
wavwrite(Sound * Amplifier, Rate, OUT_FileName);
SUCCEED = 1;
------------------------------------------------------------
function [y,fs,ffx]=readsph(filename,mode,nmax,nskip)
%READSPH Read a SPHERE/TIMIT format sound file
[Y,FS,FFX]=(FILENAME,MODE,NMAX,NSKIP)
%
% Input Parameters:
%
%   FILENAME gives the name of the file (with optional .SPH extension) or
alternatively
%                 can be the FFX output from a previous call to READSPH
having the 'f' mode option
%   MODE        specifies the following (*=default):
%
%    Scaling: 's'     Auto scale to make data peak = +-1 (use with caution
if reading in chunks)
%             'r'     Raw unscaled data (integer values)
%             'p' * Scaled to make +-1 equal full scale
%             'o'     Scale to bin centre rather than bin edge (e.g. 127
rather than 127.5 for 8 bit values)
%                      (can be combined with n+p,r,s modes)
%             'n'     Scale to negative peak rather than positive peak (e.g.
128.5 rather than 127.5 for 8 bit values)
%                      (can be combined with o+p,r,s modes)
%   Format    'l'     Little endian data (Intel,DEC) (overrides indication
in file)
%             'b'     Big endian data (non Intel/DEC) (overrides indication
in file)



                                             - 49-
Robust Speaker Identification/Verification           Kyaw Thu Ta (H0605351)
For Telephony Applications
%   File I/O: 'f'      Do not close file on exit
%               'd'    Look in data directory: voicebox('dir_data')
%
%   NMAX       maximum number of samples to read (or -1 for unlimited
[default])
%   NSKIP      number of samples to skip from start of file
%                 (or -1 to continue from previous read when FFX is given
instead of FILENAME [default])
%
% Output Parameters:
%
%   Y          data matrix of dimension (samples,channels)
%   FS         sample frequency in Hz
%   FFX        Cell array containing
%
%      {1}      filename
%      {2}      header information
%          {1} first header field name
%          {2} first header field value
%      {3}      format string (e.g. NIST_1A)
%      {4}(1) file id
%          (2) current position in file
%          (3) dataoff    byte offset in file to start of data
%          (4) order byte order (l or b)
%          (5) nsamp number of samples
%          (6) number of channels
%          (7) nbytes     bytes per data value
%          (8) bits number of bits of precision
%          (9) fs     sample frequency
%          (10) min value
%          (11) max value
%          (12) coding: 0=PCM,1=uLAW + 0=no
compression,10=shorten,20=wavpack,30=shortpack
%          (13) file not yet decompressed
%      {5}      temporary filename
%
%   If no output parameters are specified, header information will be
printed.
%   To decode shorten-encoded files, the program shorten.exe must be in th
same directory as this m-file
%

%      Copyright (C) Mike Brookes 1998
%      Version: $Id: readsph.m,v 1.3 2005/02/21 15:22:14 dmb Exp $
%
%   VOICEBOX is a MATLAB toolbox for speech processing.
%   Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%
%   This program is free software; you can redistribute it and/or modify
%   it under the terms of the GNU General Public License as published by
%   the Free Software Foundation; either version 2 of the License, or
%   (at your option) any later version.
%
%   This program is distributed in the hope that it will be useful,
%   but WITHOUT ANY WARRANTY; without even the implied warranty of
%   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
%   GNU General Public License for more details.
%
%   You can obtain a copy of the GNU General Public License from



                                             - 50-
Robust Speaker Identification/Verification           Kyaw Thu Ta (H0605351)
For Telephony Applications
%   ftp://prep.ai.mit.edu/pub/gnu/COPYING-2.0 or by writing to
%   Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%

persistent BYTEORDER
codes={'sample_count'; 'channel_count';
'sample_n_bytes';'sample_sig_bits'; 'sample_rate'; 'sample_min';
'sample_max'};
codings={'pcm'; 'ulaw'};
compressions={',embedded-shorten-';',embedded-wavpack-'; ',embedded-
shortpack-'};
if isempty(BYTEORDER) BYTEORDER='l'; end
if nargin<1 error('Usage:
[y,fs,hdr,fidx]=READSPH(filename,mode,nmax,nskip)'); end
if nargin<2 mode='p';
else mode = [mode(:).' 'p'];
end
k=find((mode>='p') & (mode<='s'));
mno=all(mode~='o');                         % scale to input limits not output
limits
sc=mode(k(1));
if any(mode=='l') BYTEORDER='l';
elseif any(mode=='b') BYTEORDER='b';
end
if nargout
    ffx=cell(5,1);
    if ischar(filename)
         if any(mode=='d')
              filename=fullfile(voicebox('dir_data'),filename);
         end
         fid=fopen(filename,'rb',BYTEORDER);
         if fid == -1
              fn=[filename,'.sph'];
              fid=fopen(fn,'rb',BYTEORDER);
              if fid ~= -1 filename=fn; end
         end
         if fid == -1
              error(sprintf('Can''t open %s for input',filename));
         end
         ffx{1}=filename;
    else
         if iscell(filename)
              ffx=filename;
         else
              fid=filename;
         end
    end

    if isempty(ffx{4});
        fseek(fid,0,-1);
        str=char(fread(fid,16)');
        if str(8) ~= 10 | str(16) ~= 10 fclose(fid); error(sprintf('File
does not begin with a SPHERE header')); end
        ffx{3}=str(1:7);
        hlen=str2num(str(9:15));
        hdr={};
        while 1
            str=fgetl(fid);
            if str(1) ~= ';'
                [tok,str]=strtok(str);


                                             - 51-
Robust Speaker Identification/Verification            Kyaw Thu Ta (H0605351)
For Telephony Applications
                   if strcmp(tok,'end_head') break; end
                   hdr(end+1,1)={tok};
                   [tok,str]=strtok(str);
                   if tok(1) ~= '-' error('Missing ''-'' in SPHERE header');
end
                   if tok(2)=='s'
                        hdr(end,2)={str(2:str2num(tok(3:end))+1)};
                   elseif tok(2)=='i'
                        hdr(end,2)={sscanf(str,'%d',1)};
                   else
                        hdr(end,2)={sscanf(str,'%f',1)};
                   end
            end
        end
        i=find(strcmp(hdr(:,1),'sample_byte_format'));
        if ~isempty(i)
            bord=char('b'+('l'-'b')*(hdr{i,2}(1)=='0'));
            if bord ~= BYTEORDER & mode~='b' & mode ~='l'
                 BYTEORDER=bord;
                 fclose(fid);
                 fid=fopen(filename,'rb',BYTEORDER);
            end
        end
        i=find(strcmp(hdr(:,1),'sample_coding'));
        icode=0;                  % initialize to PCM coding
        if ~isempty(i)
            icode=-1;                     % unknown code
            scode=hdr{i,2};
            nscode=length(scode);
            for j=1:length(codings)
                 lenj=length(codings{j});
                 if strcmp(scode(1:min(nscode,lenj)),codings{j})
                     if nscode>lenj
                          for k=1:length(compressions)
                              lenk=length(compressions{k});
                              if
strcmp(scode(lenj+1:min(lenj+lenk,nscode)),compressions{k})
                                  icode=10*k+j-1;
                                  break;
                              end
                          end
                     else
                          icode=j-1;
                     end
                     break;
                 end
            end
        end

          info=[fid; 0; hlen; double(BYTEORDER); 0; 1; 2; 16; 1 ; 1; -1;
icode];
          for j=1:7
              i=find(strcmp(hdr(:,1),codes{j}));
              if ~isempty(i)
                  info(j+4)=hdr{i,2};
              end
          end
          if ~info(5)
              fseek(fid,0,1);
              info(5)=floor((ftell(fid)-info(3))/(info(6)*info(7)));
          end



                                             - 52-
Robust Speaker Identification/Verification           Kyaw Thu Ta (H0605351)
For Telephony Applications
         ffx{2}=hdr;
         ffx{4}=info;
    end
    info=ffx{4};
    if nargin<4 nskip=info(2);
    elseif nskip<0 nskip=info(2);
    end

    ksamples=info(5)-nskip;
    if nargin>2
        if nmax>=0
            ksamples=min(nmax,ksamples);
        end
    end

    if ksamples>0
        fid=info(1);
        if icode>=10 & ~length(ffx{5})
             fclose(fid);
             dirt=voicebox('dir_temp');
             [fnp,fnn,fne,fnv]=fileparts(filename);
             filetemp=fullfile(dirt,[fnn fne fnv]);
             if exist(filetemp)                            % need to
explicitly delete old file since shorten makes read-only
                  doscom=['del /f ' filetemp];
                  if dos(doscom) % run the program
                      error(sprintf('Error running DOS command: %s',doscom));
                  end
             end
             if floor(icode/10)==1                 % shorten
                  [fnp,fnn,fne,fnv]=fileparts(mfilename('fullpath'));
                  doscom=[fullfile(fnp,'shorten.exe') ' -x -a '
num2str(info(3)) ' ' filename ' ' filetemp];
                  %                     fprintf(1,'Executing: %s\n',doscom);
                  if dos(doscom) % run the program
                      error(sprintf('Error running DOS command: %s',doscom));
                  end
             else
                  error('unknown compression format');
             end
             ffx{5}=filetemp;
             fid=fopen(filetemp,'r',BYTEORDER);
             if fid<0 error(sprintf('Cannot open decompressed file
%s',filetemp)); end
             info(1)=fid;                              % update fid
        end
        info(2)=nskip+ksamples;
        pk=pow2(0.5,8*info(7))*(1+(mno/2-
all(mode~='n'))/pow2(0.5,info(8))); % use modes o and n to determine
effective peak
        fseek(fid,info(3)+info(6)*info(7)*nskip,-1);
        nsamples=info(6)*ksamples;
        if info(7)<3
             if info(7)<2
                  y=fread(fid,nsamples,'uchar');
                  y=y-128;
             else
                  y=fread(fid,nsamples,'short');
             end
        else
             if info(7)<4


                                             - 53-
Robust Speaker Identification/Verification                 Kyaw Thu Ta (H0605351)
For Telephony Applications
                         y=fread(fid,3*nsamples,'uchar');
                         y=reshape(y,3,nsamples);
                         y=[1 256 65536]*y-pow2(fix(pow2(y(3,:),-7)),24);
                  else
                         y=fread(fid,nsamples,'long');
                  end
              end
              if sc ~= 'r'
                  if sc=='s'
                      if info(10)>info(11)
                           info(10)=min(y);
                           info(11)=max(y);
                      end
                      sf=1/max(max(abs(info(10:11))),1);
                  else sf=1/pk;
                  end
                  y=sf*y;
              end
              if info(6)>1 y = reshape(y,info(6),ksamples).'; end
       else
              y=[];
       end

       if mode~='f'
           fclose(fid);
           info(1)=-1;
           if length(ffx{5})
               doscom=['del /f ' ffx{5}];
               if dos(doscom) % run the program
                    error(sprintf('Error running DOS command: %s',doscom));
               end
               ffx{5}=[];
           end
       end
       ffx{4}=info;
       fs=info(9);
else
    [y1,fs,ffx]=readsph(filename,mode,0);
    info=ffx{4};
    if ~isempty(ffx{1}) fprintf(1,'Filename: %s\n',ffx{1}); end
    fprintf(1,'Sphere file type: %s\n',ffx{3});
    fprintf(1,'Duration = %ss: %d channel * %d samples @
%sHz\n',sprintsi(info(5)/info(9)),info(6),info(5),sprintsi(info(9)));
end




                                             - 54-
Robust Speaker Identification/Verification            Kyaw Thu Ta (H0605351)
For Telephony Applications

A.2    Telephone Quality Speech (simulation) Wave Conversion Algorithm
% Conversion of clean speech to telephone quality speech
close all
clear all

fs = 16000;
Amplifier = 10;

%define the filter order and cut-off frequency
fc = 3400; % Cut-off frequency (Hz)
order = 40; % Filter order

datadir = 'C:\FYP\LPF\Original_data';
lpodir='C:\FYP\LPF\LPF_data';
dirstr= datadir;

[rows, cols] = size(dirstr);
if rows == 0 & cols == 0
    return;
end

dirs = dir(char(dirstr));
[dirrows, dircols] = size(dirs);
for i = 1:dirrows
    if strcmp(dirs(i).name, '.') == 0 & strcmp(dirs(i).name, '..') == 0
        files = dir([char(dirstr), '/', dirs(i).name]);
        [rows, cols] = size(files);

         %Making exact folder like original data
         newdir=([char(lpodir),'\',dirs(i).name]);
         mkdir(newdir);

        for j = 1:rows
            if strcmp(files(j).name,'.')==0 & strcmp(files(j).name,'..')==0
                [r,c] = size(files(j).name);
                ext = files(j).name(c-3:c);
                if (strcmp(files(j).name(c-3), '.') == 1 & (strcmp(ext,
'wav') == 0))

                    %Define the directory to load the file for conversion
                    lpdir =([char(dirstr), '\',
dirs(i).name,'\',files(j).name]);

                        [X,fs]=wavread(lpdir);

                        %Creating the butterword fileter coefficients
                        % [0:pi] maps to [0:1] here
                        [B,A] = butter(order,2*fc/fs,'low');

                        %Convolute with the filter coefficents and original
                        %wave
                        Y = filter(B,A,X);

                        %Define the directory to write the wave after passing
                        %through low-pass filter
                        lpodir1=([char(lpodir),'\',dirs(i).name]);



                                             - 55-
Robust Speaker Identification/Verification                  Kyaw Thu Ta (H0605351)
For Telephony Applications
                              lpfile=(['LPF_',char(files(j).name)]);
                              lpfo=([char(lpodir1),'\',lpfile]);

                              wavwrite(Y,fs,lpfo);

                        end
                  end
            end
      end
end



A.3     Low-Pass Filter Frequency Response Program
%******Low-pass filter frequency response******%
close all
clear all

%Design IIR low-pass filter
order = 40; % Filter order
fs = 16000;
fc = 3400; % Cut-off frequency (Hz)
[B,A] = butter(order,2*fc/fs,'low'); % [0:pi] maps to [0:1] here

%Analyse the frequency response of the Butterowrth LPF
[H1,F1] = Freqz(B,A,1024,fs);

%The following statement plot the frequency response of the LPF
figure(3);
plot(F1,20*log10(abs(H1)+eps)), grid, zoom on;
axis([min(F1), max(F1), -50, 3]); %set the axis of plotting
title('Lowpass Filter cutoff at approximately 3.4 kHz');
ylabel('Amplitude response in [dB]');
xlabel('Frequency range in [Hz]');




A.4     Adaptive Filter training algorithm and its frequency response
% Training of Adaptive filter using LMS algorithm
close all
clear all

%Clean speech for reference at LMS algo
lpdir='C:\FYP\Test_program\ori\1\SA1out.wav';
[X,fs]=wavread(lpdir);

%Design IIR low-pass filter
order = 40;     % Filter order
fc = 3400;      % Cut-off frequency (Hz)
[B,A] = butter(order,2*fc/fs,'low');                  % [0:pi] maps to [0:1] here
Y = filter(B,A,X);

totallength=size(X,1);
sysorder=20;
N=53863 ;       % Take 53863 points for training

%begin of algorithm



                                             - 56-
Robust Speaker Identification/Verification            Kyaw Thu Ta (H0605351)
For Telephony Applications
w     = zeros ( sysorder     , 1 ) ;

for  jj = 1:200
    for n = sysorder : N
        u = Y(n:-1:n-sysorder+1) ;
        y(n)= w' * u;
        %y(n)=Y(n)+y(n);
        e(n) = X(n) - y(n) ;
        % Start with big mu for speeding the convergence then slow down to
reach the correct weights
        if n < 20
             mu=0.32;
        else
             mu=0.15;
        end
        w = w + mu * e(n) * u ;
    end
end


figure(1)
hold on
plot(X)
plot(y,'r');
zoom on;
title('System output') ;
xlabel('Samples')
ylabel('True and estimated output')

figure (2)
semilogy((abs(e))) ;
title('Error curve') ;
xlabel('Samples')
ylabel('Error value')

% Plot the spectrum of the adaptive filter
Nfft=4096;
w=w';
Window=hanning(Nfft);       % Generating the Windowing Fn using Hanning
Window
[Sxx,I]=psd(w,Nfft,fs,Window);

figure (4)
plot(I,20*log10(Sxx));grid on;
title('Frequency response of the adaptive filter');
xlabel('Frequency range in [Hz]');
ylabel('Amplitude response resonse in [dB]');

save adapt_coeff.mat w;



A.5    Wave Conversion Algorithm by using LMS adaptive filter
% Conversion algorithm for filtering telephone speech with adaptive filter
close all
clear all

load adapt_coeff.mat;




                                             - 57-
Robust Speaker Identification/Verification                     Kyaw Thu Ta (H0605351)
For Telephony Applications
%define FIR high pass filter by using coefficient from adaptive filtering
%adapt_coeff get it from adapt.m file
hd=dfilt.dffir(w);

datadir = 'C:\FYP\12_LPF_ori_wave_data\LPF_data';
adodir='C:\FYP\12_LPF_ori_wave_data\ADF_data';
dirstr= datadir;

[rows, cols] = size(dirstr);
if rows == 0 & cols == 0
    return;
end

dirs = dir(char(dirstr));
[dirrows, dircols] = size(dirs);
for i = 1:dirrows
    if strcmp(dirs(i).name, '.') == 0 & strcmp(dirs(i).name, '..') == 0
        files = dir([char(dirstr), '/', dirs(i).name]);
        [rows, cols] = size(files);

            %Making exact folder like original data
            newdir=([char(adodir),'\',dirs(i).name]);
            mkdir(newdir);

        for j = 1:rows
            if strcmp(files(j).name,'.')==0 & strcmp(files(j).name,'..')==0
                [r,c] = size(files(j).name);
                ext = files(j).name(c-3:c);
                if (strcmp(files(j).name(c-3), '.') == 1 & (strcmp(ext,
'wav') == 0))

                    %Define the directory to load the file for conversion
                    addir =([char(dirstr), '\',
dirs(i).name,'\',files(j).name]);

                              [Y,fs]=wavread(addir);

                              y1=filter(hd,Y);       %filter with high pass filter,
adaptive filter

                              %Define the directory to write the wave after passing
                              %through low-pass filter
                              adodir1=([char(adodir),'\',dirs(i).name]);
                              adfile=(['ADF_',char(files(j).name)]);
                              adfo=([char(adodir1),'\',adfile]);

                              wavwrite(y1,fs,adfo);
                        end
                  end
            end
      end
end




                                             - 58-
Robust Speaker Identification/Verification                  Kyaw Thu Ta (H0605351)
For Telephony Applications

A.6     Noise speech Conversion Algorithm
%Add white noise to clean speech

close all
clear all

SNR=10; % define various SNR

datadir = 'C:\FYP\LPF_ori_wave_data\Original_data';
cpndir='C:\FYP\noise_ori_data\C10dB';
dirstr= datadir;

[rows, cols] = size(dirstr);
if rows == 0 & cols == 0
    return;
end

dirs = dir(char(dirstr));
[dirrows, dircols] = size(dirs);
for i = 1:dirrows
    if strcmp(dirs(i).name, '.') == 0 & strcmp(dirs(i).name, '..') == 0
        files = dir([char(dirstr), '/', dirs(i).name]);
        [rows, cols] = size(files);

            %Making exact folder like original data
            newdir=([char(cpndir),'\',dirs(i).name]);
            mkdir(newdir);

        for j = 1:rows
            if strcmp(files(j).name,'.')==0 & strcmp(files(j).name,'..')==0
                [r,c] = size(files(j).name);
                ext = files(j).name(c-3:c);
                if (strcmp(files(j).name(c-3), '.') == 1 & (strcmp(ext,
'wav') == 0))

                    %Define the directory to load the file for conversion
                    datadir =([char(dirstr), '\',
dirs(i).name,'\',files(j).name]);

                              [X,fs]=wavread(datadir);

                              [SNR_S,CWn]=addwhitenoise(X,SNR);

                              %Define the directory to write the wave after adding
                              %white noise
                              cpndir1=([char(cpndir),'\',dirs(i).name]);
                              cpnfile=(['C10_',char(files(j).name)]);
                              cpno=([char(cpndir1),'\',cpnfile]);

                              wavwrite(CWn,fs,cpno);

                        end
                  end
            end
      end
end



                                             - 59-
Robust Speaker Identification/Verification           Kyaw Thu Ta (H0605351)
For Telephony Applications

function [SNR_S,CWn]=addwhitenoise(X,SNR)
%
% Add additive white noise onto Xk with different SNR Ratio
% Compute the simulated SNR for confirmation
%
N=length(X);
PowerX = std(X)^2;
MagN   = sqrt(PowerX*10^(-SNR/10));
Nk     = MagN/sqrt(2)*(randn(size(1:N)));
Nk     = Nk.';
SNR_S = 10*log10(std(X)^2/std(Nk)^2);
CWn    = X+Nk;




                                             - 60-
Robust Speaker Identification/Verification             Kyaw Thu Ta (H0605351)
For Telephony Applications


                 Appendix B: Speaker Verification System Algorithms

B.1    Speaker Training Module Algorithm
%%%%% Speaker Training Module %%%%%

%References :
% [1] http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
% [2] Speaker recognition demonstarion using matlab, Swiss Federal
%     Insititue of Technology, Lausanne, Anil Alexander and Andrzej
Drygajlo

clear all;
close all;

traindir = 'C:\FYP\3_Noise_Test_data\Test30\Train_purpose';

Training_num = 0;
n = 0;
No_of_Gaussians = 32;                 % Frequently encountered model orders
r btwn 16 and 32 gaussian components.
fs = 16000;

dirstr= traindir;

[rows, cols] = size(dirstr);
if rows == 0 & cols == 0
    return;
end

dirs = dir(char(dirstr));
[dirrows, dircols] = size(dirs);
for i = 1:dirrows
    if strcmp(dirs(i).name, '.') == 0 & strcmp(dirs(i).name, '..') == 0
        files = dir([char(dirstr), '/', dirs(i).name]);
        [rows, cols] = size(files);
        training_features = [];

        for j = 1:rows
            if strcmp(files(j).name,'.')==0 & strcmp(files(j).name,'..')==0
                [r,c] = size(files(j).name);
                ext = files(j).name(c-3:c);
                if (strcmp(files(j).name(c-3), '.') == 1 & (strcmp(ext,
'wav') == 0))

                         % files(j).name
                         speech = wavread([char(dirstr), '/', dirs(i).name,
'/',files(j).name]);

                         n=n+1;

                    % Calculate the mel cepstrum of a signal using melcepst
funciton from voicebox
                    training_features = [training_features
mfcc(speech,fs)];

                   end


                                             - 61-
Robust Speaker Identification/Verification            Kyaw Thu Ta (H0605351)
For Telephony Applications
            end
        end
        Training_num = Training_num + 1;
        [mu_train(:,:,Training_num ),sigma_train(:,:,Training_num
),c_train(:,:,Training_num )] =
gmm_estimate(training_features,No_of_Gaussians);
    end
end

save Train_purposeT30.mat mu_train sigma_train c_train



B.2    Speaker Background Model Algorithm
%%%%% Speaker Background Model %%%%%

%References :
% [1] Biometric Fusion Study, NTU Master Thesis Report, Xu Hefeng
% [2] http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
% [3] Speaker recognition demonstarion using matlab, Swiss Federal
%     Insititue of Technology, Lausanne, Anil Alexander and Andrzej
Drygajlo

clear all;
close all;

traindir = 'C:\FYP\3_Noise_Test_data\Test30\BG_model';
%code = training(t);

BG_num = 0;
%BG_model = 0;
No_of_Gaussians = 32;                 % Frequently encountered model orders
r btwn 16 and 32 gaussian components.
fs = 16000;

dirstr= traindir;

[rows, cols] = size(dirstr);
if rows == 0 & cols == 0
    return;
end

dirs = dir(char(dirstr));
[dirrows, dircols] = size(dirs);
for i = 1:dirrows
    if strcmp(dirs(i).name, '.') == 0 & strcmp(dirs(i).name, '..') == 0
        files = dir([char(dirstr), '/', dirs(i).name]);
        [rows, cols] = size(files);
        BG_features = [];
        for j = 1:rows
            if strcmp(files(j).name,'.')==0 & strcmp(files(j).name,'..')==0
                [r,c] = size(files(j).name);
                ext = files(j).name(c-3:c);
                if (strcmp(files(j).name(c-3), '.') == 1 & (strcmp(ext,
'wav') == 0))

                        % files(j).name
                        speech = wavread([char(dirstr), '/',dirs(i).name, '/',
files(j).name]);


                                             - 62-
Robust Speaker Identification/Verification                   Kyaw Thu Ta (H0605351)
For Telephony Applications

                          BG_num = BG_num +1;

                    % Calculate the mel cepstrum of a signal using melcepst
funciton from voicebox
                    BG_features = [BG_features mfcc(speech,fs)];


                    end
              end
        end
        %BG_model = BG_model + 1;
        [mu_BG(:,:),sigma_BG(:,:),c_BG(:,:)] =
gmm_estimate(BG_features,No_of_Gaussians);
    end
end

save BG_modelT30.mat mu_BG sigma_BG c_BG;


B.3    Speaker Client Evaluation Algorithm
% Speaker Client Evaluation Program

clear all;
close all;

Num_training= 40 ;           % no of           subjects in the client
S_evaluation= 2;             % no of           samples of each subject in the
evaluation set of client
S_training= 8;               % no of           samples of each subject in the training
set of client
n = Num_training*S_training;
Num_client = 40;             %FOLDER           IF 25 PPL THEN 25
Client_num = 0;
client_sample = 0;
correct_no = 0;

load Train_purposeT30.mat mu_train sigma_train c_train;
load BG_modelT30.mat mu_BG sigma_BG c_BG;

chdir_client =
{'62','65','70','73','74','26','76','28','80','84','87','32',...

'33','34','35','36','37','38','39','40','41','42','43','44',...

'45','46','47','48','49','50','51','52','53','54','55','56',...
                '57','58','59','60'};

for idir = 1:Num_client
    % get the directory


dirstr_speaker=strcat('C:\FYP\3_Noise_Test_data\Test30\Train_test_purpose\'
,chdir_client(idir));

      [rows, cols] = size(dirstr_speaker);
      if rows == 0 & cols == 0



                                             - 63-
Robust Speaker Identification/Verification                    Kyaw Thu Ta (H0605351)
For Telephony Applications
            return;
      end

    dirs = dir(char(dirstr_speaker));
    [dirrows, dircols] = size(dirs);
    for i = 1:dirrows
        if strcmp(dirs(i).name, '.') == 0 & strcmp(dirs(i).name, '..') == 0
             files = dir([char(dirstr_speaker), '/', dirs(i).name]);
             [rows, cols] = size(files);
             for j = 1:rows
                 if strcmp(files(j).name,'.')==0 &
strcmp(files(j).name,'..')==0
                     [r,c] = size(files(j).name);
                     ext = files(j).name(c-3:c);
                     if strcmp(files(j).name(c-3), '.') == 1 & (strcmp(ext,
'wav') == 0)
                         client_sample = client_sample + 1;
                         [s1, fs1] = wavread([char(dirstr_speaker), '/',
files(j).name]);

                        % Calculate the mel cepstrum of a signal using
melcepst funciton from voicebox

                                    client_features = mfcc(s1,fs1);

                        % Create back ground model
                        [BGlYM,BGlY] =
lmultigauss(client_features,mu_BG(:,:),sigma_BG(:,:),c_BG(:,:));
                        BG_client_score(:,client_sample) = mean(BGlY);

                        for kk = 1:Num_training
                            [lYM,lY] =
lmultigauss(client_features,mu_train(:,:,kk),sigma_train(:,:,kk),c_train(:,
:,kk));
                            lY1(kk,client_sample) = mean(lY);
                            client_score(kk,client_sample) =
lY1(kk,client_sample) - BG_client_score(:,client_sample);

                                    end
                              end
                        end
                  end
            end
      end
end

for jj = 1:client_sample
    if ( (max(client_score(:,jj))) > 4.08 )
        correct_no = correct_no + 1;
    end
end

frr_eva = ((client_sample - correct_no) / (client_sample)) *100;
frr_eva

save Client_eva_T30.mat client_score frr_eva;




                                              - 64-
Robust Speaker Identification/Verification           Kyaw Thu Ta (H0605351)
For Telephony Applications

B.4     Speaker Impostor Evaluation Algorithm
% Speaker Imposter Evaluation Program

clear   all;
close   all;
clear   dirs;
clear   files;
clear   impo_score;

Num_training= 40;       % no of subjects in the client
S_evaluation=2;         % no of samples of each subject in the evaluation
set of client
S_training=8;          % no of samples of each subject in the training set
of client
n = Num_training*S_training;
imposter_sample = 0;
C_imposter=40;          % no of subjects in the evaluation set of imposter
%S_imposter=10;           % no of samples of each subject in the
evaluation/testing set of
wrong_accept= 0;
correct_num = 0;

load Train_purposeT30.mat mu_train sigma_train c_train;
load BG_modelT30.mat mu_BG sigma_BG c_BG;


chdir_impo =
{'61','21','63','64','22','66','67','68','69','23','71','72',...

'24','25','75','27','77','78','79','29','81','82','83','30',...

'85','86','31','88','89','90','91','92','93','94','95','96',...
              '97','98','99','100'};

for idir = 1:C_imposter
    % get the directory

dirstr_impo=strcat('C:\FYP\3_Noise_Test_data\Test30\Imposter\',chdir_impo(i
dir));

      [rows, cols] = size(dirstr_impo);
      if rows == 0 & cols == 0
          return;
      end

    dirs = dir(char(dirstr_impo));
    [dirrows, dircols] = size(dirs);
    for i = 1:dirrows
        if strcmp(dirs(i).name, '.') == 0 & strcmp(dirs(i).name, '..') == 0
            files = dir([char(dirstr_impo), '/', dirs(i).name]);
            [rows, cols] = size(files);
            for j = 1:rows
                if strcmp(files(j).name,'.')==0 &
strcmp(files(j).name,'..')==0
                    [r,c] = size(files(j).name);
                    ext = files(j).name(c-3:c);




                                             - 65-
Robust Speaker Identification/Verification             Kyaw Thu Ta (H0605351)
For Telephony Applications
                        if strcmp(files(j).name(c-3), '.') == 1 & (strcmp(ext,
'wav') == 0)
                             imposter_sample = imposter_sample + 1;
                             [s1, fs1] = wavread([char(dirstr_impo), '/',
files(j).name]);

                        % Calculate the mel cepstrum of a signal using
melcepst funciton from voicebox
                        impo_features = mfcc(s1,fs1);
                        [BGYM,BGY] =
lmultigauss(impo_features,mu_BG(:,:),sigma_BG(:,:),c_BG(:,:));
                        BG_impo_score(:,imposter_sample) = mean(BGY);

                        for kk = 1:Num_training
                            [YM,Y] =
lmultigauss(impo_features,mu_train(:,:,kk),sigma_train(:,:,kk),c_train(:,:,
kk));
                            Y1(kk,imposter_sample) = (mean(Y));
                            impo_score(kk,imposter_sample) =
Y1(kk,imposter_sample) - BG_impo_score(:,imposter_sample);
                        end
                    end
                end
            end
        end
    end
end

for jj = 1:imposter_sample
    if ((max(impo_score(:,jj)))> 4.08)
        wrong_accept = wrong_accept + 1;
    end
end

far_eva = wrong_accept/(imposter_sample)*100;
far_eva

save Imposter_eva_T30.mat impo_score far_eva;




                                             - 66-
Robust Speaker Identification/Verification           Kyaw Thu Ta (H0605351)
For Telephony Applications

B.5    ROC and EER Algorithm
%Plotting ROC and EER Program

clear all;
close all;

load Client_eva_T30.mat client_score frr_eva;
load Imposter_eva_T30.mat impo_score far_eva;

Correct_no=0;
wrong_accept=0;
Num_training= 40;
Num_client = 40;
Num_impostor = 40;
client_sample=40*2;
imposter_sample=40*2;

for kk = 1 : client_sample
    if (max(client_score(:,kk)) > 4.08)
        Correct_no = Correct_no + 1;
    end
end

for ii = 1: imposter_sample
    if (max(impo_score(:,ii)) > 4.08)
        wrong_accept = wrong_accept + 1;
    end
end

SPR_frr = (client_sample - Correct_no) / (client_sample) *100;
SPR_frr
SPR_far = wrong_accept/(imposter_sample)*100;
SPR_far

%Draw ROC
C_S = max(client_score);
I_S = max(impo_score);
C_S1=max(C_S);
I_S1=min(I_S);
l1 = length(C_S); l2 = length(I_S);
tt = [I_S1:0.02:C_S1];          % min(I_S) to max(C_S) with interval of
5*10^4
m=length(tt);
fac1=zeros(1,m); frjt=zeros(1,m);
for i=1:m
    for j=1:l2
        if I_S(j)> tt(i)
            fac1(i)=fac1(i)+1;
        end
    end
    for k=1:l1
        if C_S(k)< tt(i)
            frjt(i)=frjt(i)+1;
        end
    end
end
Speaker_fac=100*(fac1/(l2));



                                             - 67-
Robust Speaker Identification/Verification           Kyaw Thu Ta (H0605351)
For Telephony Applications
Speaker_frjt=100*(frjt/l1);
Speaker_gar = 100-frjt;

figure (1)
plot (tt,(Speaker_fac),'r',tt,(Speaker_frjt),'b');
grid on, zoom on;
legend('FAR','FRR');
xlabel('Threshold')
ylabel('FAR / FRR (%)')
title('Speaker Verification system (Threshold Vs FAR&FRR)')

figure (2)
subplot(2,1,1)
plot(Speaker_fac,Speaker_gar,'-b');
xlabel('False acceptance rate (%)')
ylabel('Genuine Acceptance rate (%)')
title('Receiver Operating Curve')
subplot(2,1,2)
plot(Speaker_fac,Speaker_frjt,'-r');
xlabel('False Acceptance Rate (%)')
ylabel('False Rejection Rate (%)')

save roc_T30.mat Speaker_fac Speaker_frjt Speaker_gar;




                                             - 68-
Robust Speaker Identification/Verification                       Kyaw Thu Ta (H0605351)
For Telephony Applications

                                           Glossary

 Biometrics
    A general term used alternatively to describe a characteristic or a process.
    As a characteristic: A measurable biological (anatomical and physiological) and
    behavioral characteristic that can be used for automated recognition.
    As a process: Automated methods of recognition an individual based on measurable
    biological and behavioral characteristics.


 Authentication
    It is the process of confirming the correctness of the claimed identity.


 Template
    A digital representation of an individual‟s distinct characteristics, representing
    information extracted from a biometric sample.


 Enrollment
    The process of collecting a biometric sample from an end user, converting it into a
    biometric reference and storing it in the biometric system‟s database for later comparison.


   Extraction
    It is the process of converting a captured biometric sample into biometric data so that it
    can be compared to a reference.


   Features
    Distinctive mathematical characteristics derived from a biometric sample; used to
    generate a reference.


 Database
    A collection of one or more computer files. For biometric systems, these files could
    include biometric sensor readings, templates and related end user information.




                                             - 69-
 Input
Speech

         Robust Speaker Identification/Verification                      Kyaw Thu Ta (H0605351)
         For Telephony Applications


          Threshold
             It is a user setting for biometric systems operating in the verification or identification
             tasks. The acceptance or rejection of biometric data is dependent on the match score
             falling above or below the threshold. The threshold is adjustable so that the biometric
             system can be more or less strict, depending on the requirements of any given biometric
             application.


          Decision
             The resultant action taken based on a comparison of a similarity score and the system‟s
             threshold.


          False Acceptance Rate
             A statistic used to measure biometric performance when operating in the verification task.
             The percentage of times a system produces a false accept, which occurs when an
             individual is incorrectly matched to another individual‟s existing biometric.


          False Rejection Rate
             The percentage of times a system produces a false reject, which occurs when an
             individual is not matched to his or her own existing biometric template.


            Receiver Operating Characteristics
             A method of showing measured accuracy performance of a biometric system. A
             verification ROC compares false accept rate versus genuine acceptance rate.


            Equal Error Rate
             A statistic used to show biometric performance, typically when operating in the
             verification task. The EER is located where the false accept rate and false reject rate are
             equal.




                                 ~~~~~~~~END OF THE REPORT~~~~~~~~


                                                      - 70-

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:128
posted:3/26/2011
language:English
pages:77