CHALLENGES IN AUTOMATIC SPEECH RECOGNITION
Abstract: Speech recognition is an important area
of research because it offers an improved form of The problem with these early efforts was that they
human-machine interaction. This paper deals focused primarily on the end product of speech - the
with an overview of the discovery and the words people were attempting to. Generate. By the
development of the system and researches for the mid-1980s, speech recognition programmers were
accurate Speech recognition system and the building on Chomsky's ideas of grammatical structure,
and using more powerful hardware to implement
problems that need to be overcome to build statistical phoneme-chain recognition routines. It was
accurate speech recognition systems. The not until the dramatic increase of processing power of
disciplines involved in Automatic Speech the 1980's and 1990's that technology companies could
Recognition like Signal Processing, Physics seriously consider the possibility of speech
(acoustics), Pattern recognition, Communication recognition. Further, many of the major companies
and Information theory, Linguistics, physiology, collaborated to create application-programming
Computer Science, Psychology are discussed standards. This millennium has brought a wide array of
new speech recognition products and research areas.
briefly. The problems in Automatic Speech
Recognition like Recognition units; Variability 2. AUTOMATIC SPEECH RECOGNITION.
and ambiguity regarding the Speech recognition
are discussed in detail by classifying them into For decades, people have been dreaming of an
many subgroups. The classifications of the "intelligent machine" which can master natural speech.
present available Automatic Speech Recognition Automatic Speech Recognition (ASR) is one
systems are discussed. The future challenges of subsystem of this 'machine', the other subsystem being
Speech Understanding (SU). The goal of ASR is to
Automatic Speech Recognition are discussed transcribe natural speech while the goal of SU is to
briefly. understand the meaning of transcription.
Speech recognition continues to be a
challenging field for researchers. The successful Automatic Speech Recognition means different
speech recognizer is free from the constraints of things to different people. At one end of the spectrum
speakers, vocabularies, ambiguities and is the voice-operated alarm clock which ceases ringing
environment. A lot of efforts have been made in when the word 'stop' is shouted at it, and at the other
end is the automatic dictating machine which produces
this direction, but complete accuracy is still far
from reach. The task is not an easy one due to the
interdisciplinary nature of the problem. a typed manuscript in response to the human voice or
the expert system which provides answers to spoken
1. INTRODUCTION. questions. A practical speech recognizer falls
somewhere between these two extremes.
Speech Recognition got its first jump ‘startling
AT&T's Bell Labs in 1936 when, researchers
developed the first 'electronic speech synthesizer.
After, more than three decades, Threshold 3. DISCIPLINES INVOLVED IN ASR.
"technology introduced the first commercial speech
recognition Product in the early 1970's - the VIPI00 Automatic recognition of speech has been a
system. While the system performed some speech goal of research for more than four decades. Its
recognition, it could only recognize a very limited interdisciplinary nature adds to the difficulty of
number of discrete-words from a specific user. performing research in this area, Tendency of most
researchers is to apply a monolithic approach to
3.1. Signal Processing. 3.10. Recognition units.
The process of extracting relevant information from The first step in solving the recognition problem is
the speech signal in an efficient and robust manner. to decide which units to recognize. Possible candidates
are words, syllables, diphones phonemes and
3.2. Physics (acoustics).
4. THE WORDS.
The science of understanding the relationship
between the physical speech signal and physiological The word is a basic unit in speech recognition. The
mechanisms (the human vocal tract mechanism). meaning of an utterance can only be deduced after the
words of which it is composed have been recognized,
The word is so basic that a whole class of speech
3.3. Pattern Recognition. recognizers, discrete utterance or isolated word
recognizers has been designed for the sole purpose of
The set of algorithms used to cluster data to create identifying spoken words. However, these devices
one or more prototypical patterns and to match a pair require each word to be spoken in isolation; this is not
of patterns. the way in which sentences are normally produced.
3.4. Communication & Information Theory: One of the problems of employing the word as the
recognition unit is the number of words in the
The procedures for estimating parameters of language. This is of the order of 100,000. For
statistical models; the methods for detecting the recognition to take place some representation of each
presence of particular speech patterns, the set of word needs to be stored. This implies that a large,
modern coding and decoding algorithms. though not impossible, amount of storage is required.
Nevertheless there are a number of domains of man
3.5. Linguistics. machine interaction where the vocabulary can be
restricted to a much smaller number of words. In
The relationship between sounds (phonology), these situations the word is often employed as the
words in a language (syntax), meaning of spoken recognition unit.
words (semantics), and sense derived from meaning
(pragmatics). Another problem encountered in using words as
the recognition units is in determining where one
3.6. Physiology: word ends and the next begins. There is often no
acoustic evidence of word boundaries. In fact, in the
Understanding the higher order mechanisms pronunciation of continuous speech, co-articulation
within the human central nervous system that effects take place across word boundaries, altering the
account for speech production and perception in acoustic manifestation of each word, and obscuring
human beings. the boundaries.
3.7. Computer Science. 4.1. The syllables.
The study of efficient algorithms for
implementing ASR in software or hardware Instead of using words as the recognition units,
smaller units such as the syllable may be considered.
3.8. Psychology. The syllable is attractive as it has a fixed structure. It
consists of an initial consonant or consonant cluster, a
The science of understanding the factors that medial vowel or diphthong, and a final consonant or
enables the technology to be used by the human beings consonant cluster, Ci VCf. The vowel is obligatory
in practical tasks. but the consonants are optional. The intensity of
3.9. Problems of asr. every consonant is less than that of the vowels, so an
algorithm can be devised for segmenting the speech
Significant progress in ASR has been achieved by stream into syllables. Problems arise, however, with
increasing the awareness of the problems of speech strings of consonants, as it is often difficult to decide
recognition and the application of various techniques whether a consonant is part of the final consonant
to attempt to solve these problems. cluster of the last syllable or part of the initial
consonant cluster of the next syllable. vocal tracts of men are, on an average, about 30%
longer than those of women. This again gives rise to
different formant frequencies. Age and sex of the
4.2. The demisyllables. speaker cause great variations in fundamental
frequencies of speech sounds.
A very significant reduction in the number of
units can be achieved by employing demisyllables People from different parts of the country and
instead of syllables. A demisyllable consists of half different social and economic backgrounds speak with
a syllable, from the beginning of the syllable to the different dialects. This variability is even greater with
middle of the vowel, CiV, or from the middle of the people speaking a second language. The competence
vowel to the end of the syllable, VCf. A syllable with which it is spoken depends on the motivation,
can be segmented into demisyllables by splitting it intelligence, and perceptual and motor skills of the
at the point of maximum intensity. speaker, and also on the age at which even the same
speaker uttering the same words on different occasions
4.3. The Diphones. shows some variability. When a person first
encounters a speech recognizer he will be in an
Another possible recognition unit is the However, it is on this occasion that the machine will
diphone. These have been found useful as the unit be trained to recognize his voice. At subsequent
for speech synthesis but the number required, 1000- meetings he will be more relaxed, so he will address
2000, is similar to that of demisyllables. The the machine in a less formal manner. This may cause
problem of segmentation, deciding where one ends the recognition accuracy to decline.
and the next begins, however, is much more
difficult with diphones than demisyllables. 5. THE CONTEXT.
4.4. The phoneme. The production of each word still exhibits
variability, it even when a familiar situation has
The other possible recognition unit is the been reached. Co-articulation effects cause each
phoneme. The advantage of the phoneme is the small word to be pronounced differently depending al
number (Approx. 40-60 phoneme). However, upon context. The articulators anticipate the
phonemes have a number of contextual variations beginning of the next word whilst the end of the
known as allophones, and there are some 100-200 of present word is still being produced. Words are
these. Even so the small numbers involved make the also pronounced differently depending on their
phoneme, or phone, an attractive recognition unit. position in the sentence and their degree of stress.
5.1. The speaking rate.
The problem with phoneme recognition units is
segmentation. Co-articulation effects modify the Another source of variability is speaking rate.
acoustic manifestation of each phoneme. Except in The tempo of speech varies widely depending
certain cases where a voiced phoneme is followed by a upon the situation, the topic being discussed, and
the emotional state of the speaker. Unfortunately
voiceless one, or vice versa, it is impossible to tell the duration of all sounds in fast speech is not
where one phoneme ends and the next begin. reduced proportionally compared with their
duration in slow speech. In fast speech, pauses are
4.5. Variability. eliminated and steady sounds, such as vowels, are,
compressed whilst the duration of some consonants
There are a great number of factors, which cause remain almost constant The amplitude of speech
variability in speech. These include the speaker, the signals depends on the amount of vocal effort
context, the speaking rate, the environment employed and the distance of the microphone from
(Extraneous sound and vibration) and the the mouth. The vocal effort affects the shape of the
transducer employed. glottal pulse, and thus the intensity and frequency
of the speech signal. The distance between the
4.6. The speaker. microphone and the mouth can vary with a hand-
held microphone, but can be kept approximately
The speech signal is Very dependent on the constant by means of a microphone on a boom
physical characteristics of the speaker. The size of the attached to a headset.
vocal tract increases during childhood, and this gives
rise to different formant frequencies for the 5.2. The environment.
productions of the same vowel at different ages. The
The background sound in many circumstances
is an uncontrollable variable. If this sound is we plot a graph of F1 versus F2 for a large number of
constant and always present, such as the hum from vowels spoken by a variety of speakers, the points
the cooling fan in a computer, the level can be plotted do not form separate areas. For e.g., the,
measured and its effect subtracted from the speech vowel / a/ spoken by one person may have identical
signals. If the background noise level is variable, it formant frequencies to a vowel 131 spoken by
is important that the signal should be made a5 nigh another.
as possible. This is usually achieved by holding the
microphone close to the mouth, and by using a 5.8. Syntactic ambiguity.
directional, noise canceling microphone.
Even if the phoneme sequence can be recognized
5.3. The reverberation. and correctly segmented into words, there may still be
ambiguity of meaning until all the words are grouped
The speech signal may be distorted by into appropriate syntactic units.
reverberation. As well as the direct path from the
mouth to the microphone, there will be other 5.9. Word boundaries.
acoustic paths due to reflections from objects such
as walls and furniture. These paths will be longer Another problem of ambiguity concerns the
than the direct path, and so will add delayed and. location of word boundaries. Occasionally a sequence
distorted versions of the signal to the original. of phonemes occurs which has one interpretation with
Working in an anechoic chamber could eliminate the word boundary inserted at one location, and
reverberation, but this is not usually practical. It another meaning with it inserted at another location.
should be noted that the introduction of extra items This may involve shifting the boundary by a single
of equipment or the presence of other bodies might phoneme, such as Igreiteipl that may be interpreted as
distort the signal. 'grey tape' or 'great ape', or it may mean moving the
word boundary by a whole syllable, for example
5.4. The transducer. IiaIthauskip31 may mean. 'Light housekeeper' or
The transducer, used for converting the acoustic
signal into an electrical signal may introduce 6. CLASSIFICATION OF ASR SYSTEMS:
distortion. If the same microphone is always used, this
will not cause variability. If different microphones are 6.1. Vocabulary Size Restrictions
used, however, as will be the case when a speech
recognizer is used via the telephone system, the Small: 100-300 words
characteristics of the different microphones and their Medium: 1000 words
associated transmission channels, will introduce Large: lOk-50k words
6. 2. Speaking Style Restrictions.
A further problem for a speech recognizer is that of Connected Word
ambiguity. This becomes important when the system is Continuous Word
required to perform some action as a result of the
signals, which it has received. 6.3 Speaker Dependence
5.6. Homophones. Speaker-dependent
There are a number of words, which have different Speaker-independent
spellings, and meanings, but which, nevertheless, (Closed-speaker)
sound alike. For example, consider the words 'to', 'too' Speaker-independent
and 'two'. In applications such as a speech-driven word
processor, homophones present problems. These 7. RELEVANT ISSUES OF ASR DESIGN:
problems cannot be resolved at the acoustic or
phonetic levels. Recourse. Must be had to higher levels Environment Type of noise, Signal/Noise Ratio;
of linguistic analysis. Working Conditions
Transducer Micro Phone, Telephone
5.7. Overlapping classes. Channel Band Amplitude; distortion, echo
Speakers Speaker-dependence, speaker-
The first and second formant frequencies (F1 & independence; sex age; physical state
F2) are able to identify most of vowels. Though, if Speech Styles Voice tone(quiet, normal, shout)
production (isolated words,
continuous speech, read or
spontaneous speech); speed (slow,
Vocabulary Characteristics of available training
data; specific or generic vocabulary
8. THE FUTURE CHALLENGES OF ASR.
Looking ahead, the ultimate challenge for
designers is to develop a system that can match the
ability of humans to recognize languages. According
to Microsoft's Whisper project on Speech research,
their goal is to "develop a general purpose, speaker-
independent. Continuous speech recognition engine
that can recognize Unrestricted. text and is effective
for command and control, dictation, and
conversational systems." While this ambitious goal
appears lofty, it may not be that far away.
Further research will continue to improve statistical
models for analyzing speech, not only by improving
mathematical algorithms that adapt to the unique style
of the speaker, but also by better control of the varying
environments from which speaker might use the
application. Researchers are expending a fair amount
of resources to strengthen the underlying statistics
behind their language models. However, speech
recognition devices will never proliferate the
mainstream market unless they are able to offer better
control of outside noise (e.g. in an office or ii, a car).
This paper has discussed the problems in
building accurate and robust speech recognition
systems. We categorize these problems as
recognition units (phoneme, Syllables, Diphones
etc.), variability (Speaker, Context, environment
etc.) and ambiguity (homophones, word boundaries
etc.). The basic problem is the paradox that speech
consists of a continuous stream of sound with no
obvious discontinuities at the boundaries between
the words and at speech is perceived as a sequence
of words. It is almost impossible to predict
accurately the rate of progress in any scientific
field. However, based on the rate progressed over
the past decade, it seems reasonable to make some
broad projects as to where speech recognition is
headed in the next decade
 Digit Magazine