Speaker recognition 1 by harish1991


									               Fundamentals Behind “Speaker Recognition”
       A human identity recognition system based on voice analysis could
have seamless applications. The ASR (Automatic Speaker Recognition) is one
such system.

       Automatic Speaker Recognition is a system that can recognize a person
based on his/her voice. This is achieved by implementing complex signal
processing algorithms that run on a digital computer or a processor.

This Application is analogous to the fingerprint recognition system or other
biometrics recognition systems that are based on certain characteristics of a

There are several occasions when we want to identify a person from a given
group of people even when the person is not present for physical examination.
For example, when a person converses on a telephone, all we have is the
person‟s voice for analysis. It then makes sense to develop a recognition system
based on voice.

Speaker recognition has typically been classified as either a verification or
identification task. Speaker verification is usually the simpler of the two since it
involves the comparison of the input signal with a single given stored reference
pattern. Therefore, the verification task only requires a system to verify, if the
speaker is the same as the person he/she identifies himself/herself. Speaker
identification is more complex because the test speaker must be compared
against a number of reference speakers to determine if a match can be made.
Not only the input signal is to be examined to see if it came from a speaker, but
the identification of the individual speaker is also necessary.
The identification of speakers remains a difficult task for a number of reasons.
First, the acquiring of a unique speech signal can suffer as a result of the
variation of the voice inputs from a speaker and environmental factors. Both the
volume and pace of speech can vary from one test to another. Also, unless
initially constrained, an extensive vocabulary or unstructured grammar can affect
results. Background noise must also be kept to the minimum so that a changing
environment will not divert the speaker‟s attention or the final voicing of a word or
sentence. As a result, many restrictions and clarifications have been placed on
speaker and speech recognition systems.

One such restriction involves using a closed set for speaker recognition. A closed
set implies that only speakers within the original stored set will be asked to be
identified. An open set would allow the extra possibility of a test speaker not
coming from the initially trained set of speakers, thereby requiring the system to
recognize the speaker as not belonging to the original set. An open set system
may also have the task to learning a new speaker and placing him or her within
the original set for future reference.

Another common restriction involves using a test dependent speaker recognition
system. This type of system would require the speaker to utter a unique word or
phrase to be compared against the original set of like phrases. Text-independent
recognition, which for most cases is more complex and difficult to perform,
identifies the speaker regardless of the text or phrase spoken.

Once an utterance, or signal, has been recorded, it is usually necessary to
process it to get the voiced signal in a form that makes classification and
recognition possible. Various methods have included the use of power spectrum
values, spectrum coefficients, linear predictive coding, and a nearest neighbour
distance algorithm. Tests have also shown that although spectrum coefficients
and linear predictive coding have given better results for conventional template
and statistical classification methods, power spectrum values have performed
better when using neural networks during the final recognition stages.

Various methods have also been used to perform the classification and
recognition of the processed speech signal. Statistical methods utilizing Hidden
Markov Models, linear vector quantifiers, or classical techniques such as
template matching have produced encouraging, yet limited success. Recent
deployments using neural networks, while producing varied success rates, have
offered more options regarding the types of inputs sent to the networks, as well
as provided the ability to learn speakers in both an off and online manner.
Although    back-propagation     networks       have   traditionally been   used,   the
implementation of more sophisticated networks, such as an ART 2 network, has
been made.

ASR can be broadly classified into four types:

              1.      Text-independent identification
              2.      Text-independent verification
              3.      Text-dependent identification
              4.      Text-dependent verification

Speaker identification is a procedure by which a speaker is identified from a
group of „n‟ people. It should be noted that a totally new speaker not belonging to
the group could wrongly be identified as someone from within the group.

       Speaker verification is a procedure by which a speaker who claims his/her
identity is verified as being correct or not.
A fundamental requirement for any ASR system is gathering reference samples
and finding certain features from the voice that are characteristic to a person.
These feature vectors are then stored. When a new test sample is made
available, the references are either searched to find the closest match (in case of
identification), or a threshold of a distance measure is checked (in case of

The next aspect to the considered is text-dependency. In a text-independent
situation, the reference utterance and the test utterance are not the same. This
type of recognition system finds its applications in criminology. In a text-
dependent situation, the reference utterance and the test utterance are the same,
which gives us a higher degree of accuracy. This type of recognition system has
applications where security is a matter of concern, such as access to a building
to a lab, to a computer, etc.
                                      System Configuration
Figures 1 and 2 show the identification system and the verification system
configuration, respectively.

The first part of the system consists of the data acquisition hardware that
acquires the speech, performs some signal conditioning, digitizes it and gives it
to the computer/processor.

The second part consists of core signal processing and system identification
techniques to extract speaker specific features. These features are stored and
are used at a later time for the actual identification/verification test. At this stage,
the system is ready for identification or verification.

Now, when the test sample is uttered by one of the members of the group, the
speech is digitized and the features are extracted. For identification, distances
between this vector and all the reference vectors are measured and the closest
vector is picked up as the correct one. This vector would correspond to a person
whom the system claims as having been identified. For verification, the person
claims his/her identity. The distance between the corresponding reference vector
and the test vector is the computed. If the measured distance is less than a set
threshold, the verification system accepts the speaker; if not, it rejects the
                  ADC WITH         ALGORITHM
   VOICE           SIGNAL                       MEASUREMENT         REFERENCE
                                    TO SELECT
  SAMPLE        CONDITIONING                     OF DISTANCE         VECTORS

                                                DECISION       IDENTITY
                                                MAKING         OF PERSON

Figure 1: Speaker Identification
   CLAIMS       CONDITIONING      FEATURES                           THE
  HIS/HER                                                          SPEAKER

                                              THRESHOLD         PERSON
                                              COMPARISON       VERIFIED
                                                                OR NOT

Figure 2: Speaker Verification
                        How the system works?

The voice input to the microphone produces an analogue speech signal. An
analogue-to-digital converter (ADC) converts this speech signal into binary words
that are compatible with digital computer. The converted binary version is then
stored in the system and compared with previously stored binary representations
of words and phrases.

The current input speech is compared one at a time with the previously stored
speech pattern after searching by the computer. When a match occurs,
recognition is achieved. The spoken word is binary form is written on a video
screen or passed along to a natural language understanding processor for
additional analysis.

Since most recognition systems are speaker-dependent, it is necessary to train a
system to recognize the dialect of each new user. During training, the computer
displays a word and the user reads it aloud. The computer digitizes the user‟s
voice and stores it. The speaker has to read aloud about 1,000 words. Based on
these samples, the computer can predict how the user utters some words that
are likely to be pronounced differently by different people.

The block diagram of a speaker-dependent word recognizer is shown in Fig. 4.
The user speaks before the microphone, which converts the sound into electrical
signal. The electrical analogue signal from microphone is fed to an amplifier
provided with automatic gain control (AGC) to produce an amplified output signal
in a specific optimum voltage range, even when the input signal varies from
feeble to loud.
The analogue signal, representing a spoken word, contains many individual
frequencies of various amplitudes and different phases, which when blended
together take the shape of a complex waveform as show in Fig. 3. A set of filters
is used to break this complex input signal into its component parts. Bandpass
filters (BEP) pass on frequencies only in certain frequency range, rejecting all
other frequencies.

Generally, about sixteen filters are used; a simple system may contain a
minimum of three filters. The more the number of filters user, the higher the
probability of accurate recognition.

Presently, switched capacitor digital filters are used because these can be
custom-built in integrated circuit form. These are smaller and cheaper than active
filters using operational amplifiers.

The filter output is then fed to the ADC to translate the analogue signal into digital
word. The ADC samples the filter outputs many times a second. Each sample
represents a different amplitude of the signal as shown in Fig. 4.

Evenly spaced vertical lines represent the amplitude of the audio filter output at
the instant of sampling. Each value is then converted to a binary number
proportional to the amplitude of the sample. A central processor unit controls the
input circuits that are fed by the ADCs. A large RAM stores all the digital values
in a buffer area. This digital information, representing the spoken word, is now
accessed by the CPU to process it further.

The normal speech has a frequency range of 200 Hz to 7 kHz. Recognizing a
telephone call is more difficult as it has bandwidth limitation of 300Hz to 3.3 Hz.
As explained earlier, the spoken words are processed by the filters and ADCs.
The binary representation of each of these words becomes a template or
standard, against which the future words are compared. These templates are
stored in the memory. Once the storing process is completed, the system can go
into its active mode and is capable of identifying spoken words.

As each word is spoken, it is converted into binary equivalent and stored in RAM.
The computer then starts searching and compares the binary input pattern with
the templates.

It is to be noted that even if the same speaker talks the same text, there are
always slight variations in amplitude or loudness of the signal, pitch, frequency
difference, time gap, etc. Due to this reason, there is never a perfect match
between the template and binary input word. The pattern matching process
therefore uses statistical techniques and is designed to look for the best fit.

The values of binary input words are subtracted from the corresponding values,
in the templates. If both the values are same, the difference is zero and there is
perfect match. If not, the subtraction produces some difference or error. The
smaller the error, the better the match. When the best match occurs the word is
identified and displayed on the screen or used in some other manner.

The search process takes a considerable amount of time as the CPU has to
make many comparisons before recognition occurs. This necessitates use of
very high-speed processors. A large RAM is also required as even though a
spoken word may last only a few hundred milliseconds, but the same is
translated into many thousands of digital words.

It is important to not e that alignment of words and templates are to be matched
correctly in time, before computing the similarity score. This process, termed as
dynamic time warping, recognizes that different speaker pronounce the same
words at different speeds as well as elongate different parts of the same word.
This is important for the speaker-independent recognizers.

Continuous speech recognizers are far more difficult to build than word
recognizers. You can speak complete sentences to the computer. The input will
be recognized and, when processed by NLP, understood.

Such recognizers employ sophisticated, complex techniques to deal with
continuous speech, because when one speaks continuously, most of the words
slur together and it is difficult for the system to know where one word ends and
the other begins. Unlike word recognizers, the information spoken is not
recognized instantly by this system.

        Artificial Intelligence for Speech Recognition

When you dial the telephone number of a big company, you are likely to hear the
sonorous voice of a cultured lady who responds to your call with great courtesy
saying “Welcome to company X. Please give me the extension number you
want”. You pronounce the extension number, your name, and the name of
person you want to contact. If the called person accepts the call, the connection
is given quickly. This is artificial intelligence where an automatic call-handling
system is used without employing any telephone operator.

Artificial intelligence involves two basic ideas. First, it involves studying the
thought processes of human beings. Second, it deals with representing those
processes via machines (like computers, robots, etc.).
AI is behaviour of a machine, which, if performed by a human being, would be
called intelligent. It makes machines smarter and more useful, and is less
expensive than natural intelligence.

Natural language processing (NLP) refers to artificial intelligence methods of
communicating with a computer in a natural language like English. The main
objective of a NLP program is to understand input and initiate action.

The input words are scanned and matched against internally stored known
words. Identification of a key word causes some action to be taken. In this way,
one can communicate with the computer in one‟s language. No special
commands or computer language are required. There is no need to enter
programs in a special language for creating software.

                        Speaker independency

The speech quality varies from person to person. It is therefore difficult to build
an electronic system that recognises everyone‟s voice. By limiting the system to
the voice of a single person, the system becomes not only simpler but also more
reliable. The computer must be trained to the voice of that particular individual.
Such a system is called speaker-dependent system.

Speaker independent systems can be used by anybody, and can recognize any
voice, even though the characteristics vary widely from one speaker to another.
Most of these systems are costly and complex. Also, these have very limited

It is important to consider the environment in which the speech recognition
system has to work. The grammar used by the speaker and accepted by the
system, noise level, noise type , position of the microphone, and speed and
manner of the user‟s speech are some factors that may affect the quality of
speech recognition.

                          Environmental influence:-

Real applications demand that the performance of the recognition system be
unaffected by changes in the environment. However, it is a fact that when a
system is trained and tested under different conditions, the recognition rate drops
unacceptably.    We need to be concerned about the variability present when
different microphones are used in training and testing, and specifically during
development of procedures. Such care can significantly improve the accuracy of
recognition systems that use desktop microphones.

Acoustical distortions can degrade the accuracy of recognition systems.
Obstacles to robustness include additive noise from machinery, competing
talkers, reverberation from surface reflections in a room, and spectral shaping by
microphones and the vocal tracts of individual speakers. These sources of
distortions fall into two complementary classes; additive noise and distortions
resulting from the convolution of the speech signal with an unknown linear

A number of algorithms for speech enhancement have been proposed. These
include the following:

       1.     Spectral subtraction of DFT coefficients
       2.     MMSE techniques to estimate the DFT coefficients of corrupted
3.     Spectral equalization to compensate for convoluted distortions
4.     Spectral subtraction and spectral equalization.
Although relatively successful, all these methods depend on the assumption of
independence     of   the   spectral   estimates   across   frequencies.    Improved
performance can be got with an MMSE estimator in which correlation among
frequencies is modeled explicitly.

                            Speaker-specific features:-

Speaker identity correlates with the physiological and behavioral characteristics
of the speaker. These characteristics exist both in the vocal tract characteristics
and in the voice source characteristics, as also in the dynamic features spanning
several segments.

The most common short-term spectral measurements currently used are the
spectral coefficients derived from the Linear Predictive Coding(LPC) and their
regression coefficients. A spectral envelope reconstructed from a truncated set of
spectral coefficients is much smoother than one reconstructed from LPC
coefficients. Therefore, it provides a more stable representation from one
repetition to another of a particular speaker‟s utterances. As for the regression
coefficients, typically the first and second order coefficients are extracted at every
frame period to represent the spectral dynamics.

These coefficients are derivatives of the time function of the spectral coefficients
and are called the delta and delta-delta-spectral coefficients respectively.

       SPEAKER       SPEECH
                   RECOGNITION                          COMMANDS TO
                     DEVICE                              COMPUTERS

                                                       INPUT TO OTHER
                                                        CBISs, ROBOTS,
                                                       EXPERT SYSTEMS

                                 NLP             UNDERSTANDING

Figure 3

One of the main benefits of speech recognition system is that it lets user do other
works simultaneously. The user can concentrate on observation and manual
operations, and still control the machinery by voice input commands.

Consider a material-handling plant, where a number of conveyors are employed
to transport various grades of materials to different destinations. Nowadays, only
one operator is employed to run the plant. He has to keep a watch on various
meters, gauges, indication lights, analyzers, overload devices, etc from the
central control panel. If something wrong happens, he has to run to physically
push the stop button. How convenient it would be if a conveyor or a number of
conveyors are stopped automatically by simply saying stop.

Another major application of speech processing is in military operations. Voice
control of weapons is an example. With reliable speech recognition equipment,
pilots can give commands and information to the computers by simply speaking
into their microphones - they don‟t have to use their hands for this purpose.

Another good example is a radiologist scanning hundreds of X-rays,
ultrasonograms, CT scans and simultaneously dictating conclusions to a speech
recognition system connected o word processors. The radiologist can focus his
attention on the images rather than writing the text.

Voice recognition could also be used on computers for making airline and hotel
reservations. A user requires simply to state his needs, to make reservation,
cancel a reservation, or make enquiries about schedule.
                                              I        RAM
                                              N     DIGITESED
                         BPF        ADC       P      SPEECH
                         BPF        ADC       T
                         BPF        ADC       I
                                              R    SEARCH AND
                         BPF        ADC       C      PATTERN
                                              U     MATCHING
                                              I     PROGRAM

                            OUTPUT                CPU

Figure 4: Speaker-dependent word recognizer
By using this speaker recognition technology we can achieve many uses. This
technology helps physically challenged skilled persons. These people can do
their works by using this technology with out pushing any buttons. This ASR
technology is also used in military weapons and in Research centers. Now a
days this technology was also used by CID officers. They used this to trap the
criminal activities.

Page No
  1. INTRODUCTION                                 1
  3. CLASSIFICATION OF “ASR”                      3
  4. SYSTEM CONFIGURATION                         5
  5. HOW THE SYSTEM WORKS                         8
  7. SPEAKER INDEPENDENCY                        12
  8. ENVIRONMENTAL INFLUENCE                     13
  9. SPEAKER SPECIFIC FEATURES                   14
  10. APPLICATIONS                               16
  11. CONCLUTION                                 17
  12. B0BLIOGRAPHY                               18

To top