Docstoc

CHALLENGES IN AUTOMATIC SPEECH RECOGNITION

Document Sample
CHALLENGES IN AUTOMATIC SPEECH RECOGNITION Powered By Docstoc
					                 CHALLENGES IN AUTOMATIC SPEECH RECOGNITION


 Abstract: Speech recognition is an important area
 of research because it offers an improved form of     The problem with these early efforts was that they
 human-machine interaction. This paper deals           focused primarily on the end product of speech - the
 with an overview of the discovery and the             words people were attempting to. Generate. By the
 development of the system and researches for the      mid-1980s, speech recognition programmers were
 accurate Speech recognition system and the            building on Chomsky's ideas of grammatical structure,
                                                       and using more powerful hardware to implement
 problems that need to be overcome to build            statistical phoneme-chain recognition routines. It was
 accurate speech recognition systems. The              not until the dramatic increase of processing power of
 disciplines involved in Automatic Speech              the 1980's and 1990's that technology companies could
 Recognition like Signal Processing, Physics           seriously consider the possibility of speech
 (acoustics), Pattern recognition, Communication       recognition. Further, many of the major companies
 and Information theory, Linguistics, physiology,      collaborated to create application-programming
 Computer Science, Psychology are discussed            standards. This millennium has brought a wide array of
                                                       new speech recognition products and research areas.
 briefly. The problems in Automatic Speech
 Recognition like Recognition units; Variability         2. AUTOMATIC SPEECH RECOGNITION.
 and ambiguity regarding the Speech recognition
 are discussed in detail by classifying them into           For decades, people have been dreaming of an
 many subgroups. The classifications of the            "intelligent machine" which can master natural speech.
 present available Automatic Speech Recognition        Automatic Speech Recognition (ASR) is one
 systems are discussed. The future challenges of       subsystem of this 'machine', the other subsystem being
                                                       Speech Understanding (SU). The goal of ASR is to
 Automatic Speech Recognition are discussed            transcribe natural speech while the goal of SU is to
 briefly.                                              understand the meaning of transcription.
    Speech recognition continues to be a
 challenging field for researchers. The successful          Automatic Speech Recognition means different
 speech recognizer is free from the constraints of     things to different people. At one end of the spectrum
 speakers, vocabularies, ambiguities and               is the voice-operated alarm clock which ceases ringing
 environment. A lot of efforts have been made in       when the word 'stop' is shouted at it, and at the other
                                                       end is the automatic dictating machine which produces
 this direction, but complete accuracy is still far
 from reach. The task is not an easy one due to the
 interdisciplinary nature of the problem.              a typed manuscript in response to the human voice or
                                                       the expert system which provides answers to spoken
              1. INTRODUCTION.                         questions. A practical speech recognizer falls
                                                       somewhere between these two extremes.
   Speech Recognition got its first jump ‘startling
AT&T's Bell Labs in 1936 when, researchers
developed the first 'electronic speech synthesizer.
After, more than three decades, Threshold                  3. DISCIPLINES INVOLVED IN ASR.
"technology introduced the first commercial speech
recognition Product in the early 1970's - the VIPI00       Automatic recognition of speech has been a
system. While the system performed some speech         goal of research for more than four decades. Its
recognition, it could only recognize a very limited    interdisciplinary nature adds to the difficulty of
number of discrete-words from a specific user.         performing research in this area, Tendency of most
                                                       researchers is to apply a monolithic approach to
                                                       individual problems.


                                            ,r
3.1. Signal Processing.                                    3.10. Recognition units.
   The process of extracting relevant information from        The first step in solving the recognition problem is
 the speech signal in an efficient and robust manner.      to decide which units to recognize. Possible candidates
                                                           are words, syllables, diphones phonemes and
                                                           distinctive features.
3.2. Physics (acoustics).
                                                                              4. THE WORDS.
   The science of understanding the relationship
between the physical speech signal and physiological           The word is a basic unit in speech recognition. The
mechanisms (the human vocal tract mechanism).              meaning of an utterance can only be deduced after the
                                                           words of which it is composed have been recognized,
                                                           The word is so basic that a whole class of speech
3.3. Pattern Recognition.                                  recognizers, discrete utterance or isolated word
                                                           recognizers has been designed for the sole purpose of
    The set of algorithms used to cluster data to create   identifying spoken words. However, these devices
one or more prototypical patterns and to match a pair      require each word to be spoken in isolation; this is not
of patterns.                                               the way in which sentences are normally produced.
3.4. Communication & Information Theory:                       One of the problems of employing the word as the
                                                           recognition unit is the number of words in the
    The procedures for estimating parameters of            language. This is of the order of 100,000. For
statistical models; the methods for detecting the          recognition to take place some representation of each
presence of particular speech patterns, the set of         word needs to be stored. This implies that a large,
modern coding and decoding algorithms.                     though not impossible, amount of storage is required.
                                                           Nevertheless there are a number of domains of man
3.5. Linguistics.                                          machine interaction where the vocabulary can be
                                                           restricted to a much smaller number of words. In
    The relationship between sounds (phonology),           these situations the word is often employed as the
words in a language (syntax), meaning of spoken            recognition unit.
words (semantics), and sense derived from meaning
(pragmatics).                                                   Another problem encountered in using words as
                                                           the recognition units is in determining where one
3.6. Physiology:                                           word ends and the next begins. There is often no
                                                           acoustic evidence of word boundaries. In fact, in the
   Understanding the higher order mechanisms               pronunciation of continuous speech, co-articulation
within the human central nervous system that               effects take place across word boundaries, altering the
account for speech production and perception in            acoustic manifestation of each word, and obscuring
human beings.                                              the boundaries.
3.7. Computer Science.                                     4.1. The syllables.
   The study of efficient algorithms for
implementing ASR in software or hardware                       Instead of using words as the recognition units,
                                                           smaller units such as the syllable may be considered.
3.8. Psychology.                                           The syllable is attractive as it has a fixed structure. It
                                                           consists of an initial consonant or consonant cluster, a
    The science of understanding the factors that          medial vowel or diphthong, and a final consonant or
enables the technology to be used by the human beings      consonant cluster, Ci VCf. The vowel is obligatory
in practical tasks.                                        but the consonants are optional. The intensity of
3.9. Problems of asr.                                      every consonant is less than that of the vowels, so an
                                                           algorithm can be devised for segmenting the speech
    Significant progress in ASR has been achieved by       stream into syllables. Problems arise, however, with
increasing the awareness of the problems of speech         strings of consonants, as it is often difficult to decide
recognition and the application of various techniques      whether a consonant is part of the final consonant
to attempt to solve these problems.                        cluster of the last syllable or part of the initial
consonant cluster of the next syllable.                    vocal tracts of men are, on an average, about 30%
                                                           longer than those of women. This again gives rise to
                                                           different formant frequencies. Age and sex of the
4.2. The demisyllables.                                    speaker cause great variations in fundamental
                                                           frequencies of speech sounds.
     A very significant reduction in the number of
units can be achieved by employing demisyllables               People from different parts of the country and
instead of syllables. A demisyllable consists of half      different social and economic backgrounds speak with
a syllable, from the beginning of the syllable to the      different dialects. This variability is even greater with
middle of the vowel, CiV, or from the middle of the        people speaking a second language. The competence
vowel to the end of the syllable, VCf. A syllable          with which it is spoken depends on the motivation,
can be segmented into demisyllables by splitting it        intelligence, and perceptual and motor skills of the
at the point of maximum intensity.                         speaker, and also on the age at which even the same
                                                           speaker uttering the same words on different occasions
4.3. The Diphones.                                         shows some variability. When a person first
                                                           encounters a speech recognizer he will be in an
     Another possible recognition unit is the              However, it is on this occasion that the machine will
diphone. These have been found useful as the unit          be trained to recognize his voice. At subsequent
for speech synthesis but the number required, 1000-        meetings he will be more relaxed, so he will address
2000, is similar to that of demisyllables. The             the machine in a less formal manner. This may cause
problem of segmentation, deciding where one ends           the recognition accuracy to decline.
and the next begins, however, is much more
difficult with diphones than demisyllables.                                5. THE CONTEXT.

4.4. The phoneme.                                             The production of each word still exhibits
                                                           variability, it even when a familiar situation has
    The other possible recognition unit is the             been reached. Co-articulation effects cause each
phoneme. The advantage of the phoneme is the small         word to be pronounced differently depending al
number (Approx. 40-60 phoneme). However,                   upon context. The articulators anticipate the
phonemes have a number of contextual variations            beginning of the next word whilst the end of the
known as allophones, and there are some 100-200 of         present word is still being produced. Words are
these. Even so the small numbers involved make the         also pronounced differently depending on their
phoneme, or phone, an attractive recognition unit.         position in the sentence and their degree of stress.
                                                           5.1. The speaking rate.
     The problem with phoneme recognition units is
segmentation. Co-articulation effects modify the                Another source of variability is speaking rate.
acoustic manifestation of each phoneme. Except in          The tempo of speech varies widely depending
certain cases where a voiced phoneme is followed by a      upon the situation, the topic being discussed, and
                                                           the emotional state of the speaker. Unfortunately
voiceless one, or vice versa, it is impossible to tell     the duration of all sounds in fast speech is not
where one phoneme ends and the next begin.                 reduced proportionally compared with their
                                                           duration in slow speech. In fast speech, pauses are
4.5. Variability.                                          eliminated and steady sounds, such as vowels, are,
                                                           compressed whilst the duration of some consonants
   There are a great number of factors, which cause        remain almost constant The amplitude of speech
variability in speech. These include the speaker, the      signals depends on the amount of vocal effort
context, the speaking rate, the environment                employed and the distance of the microphone from
(Extraneous sound and vibration) and the                   the mouth. The vocal effort affects the shape of the
transducer employed.                                       glottal pulse, and thus the intensity and frequency
                                                           of the speech signal. The distance between the
4.6. The speaker.                                          microphone and the mouth can vary with a hand-
                                                           held microphone, but can be kept approximately
    The speech signal is Very dependent on the             constant by means of a microphone on a boom
physical characteristics of the speaker. The size of the   attached to a headset.
vocal tract increases during childhood, and this gives
rise to different formant frequencies for the              5.2. The environment.
productions of the same vowel at different ages. The
                                                               The background sound in many circumstances
is an uncontrollable variable. If this sound is            we plot a graph of F1 versus F2 for a large number of
constant and always present, such as the hum from          vowels spoken by a variety of speakers, the points
the cooling fan in a computer, the level can be            plotted do not form separate areas. For e.g., the,
measured and its effect subtracted from the speech         vowel / a/ spoken by one person may have identical
signals. If the background noise level is variable, it     formant frequencies to a vowel 131 spoken by
is important that the signal should be made a5 nigh        another.
as possible. This is usually achieved by holding the
microphone close to the mouth, and by using a              5.8. Syntactic ambiguity.
directional, noise canceling microphone.
                                                               Even if the phoneme sequence can be recognized
5.3. The reverberation.                                    and correctly segmented into words, there may still be
                                                           ambiguity of meaning until all the words are grouped
    The speech signal may be distorted by                  into appropriate syntactic units.
reverberation. As well as the direct path from the
mouth to the microphone, there will be other               5.9. Word boundaries.
acoustic paths due to reflections from objects such
as walls and furniture. These paths will be longer               Another problem of ambiguity concerns the
than the direct path, and so will add delayed and.         location of word boundaries. Occasionally a sequence
distorted versions of the signal to the original.          of phonemes occurs which has one interpretation with
Working in an anechoic chamber could eliminate             the word boundary inserted at one location, and
reverberation, but this is not usually practical. It       another meaning with it inserted at another location.
should be noted that the introduction of extra items       This may involve shifting the boundary by a single
of equipment or the presence of other bodies might         phoneme, such as Igreiteipl that may be interpreted as
distort the signal.                                        'grey tape' or 'great ape', or it may mean moving the
                                                           word boundary by a whole syllable, for example
5.4. The transducer.                                       IiaIthauskip31 may mean. 'Light housekeeper' or
                                                           'lighthouse keeper'
    The transducer, used for converting the acoustic
signal into an electrical signal may introduce                   6. CLASSIFICATION OF ASR SYSTEMS:
distortion. If the same microphone is always used, this
will not cause variability. If different microphones are   6.1. Vocabulary Size Restrictions
used, however, as will be the case when a speech
recognizer is used via the telephone system, the           Small: 100-300 words
characteristics of the different microphones and their     Medium: 1000 words
associated transmission channels, will introduce           Large: lOk-50k words
variability.
                                                           6. 2. Speaking Style Restrictions.
5.5. Ambiguity.
                                                           Isolated Word
    A further problem for a speech recognizer is that of   Connected Word
ambiguity. This becomes important when the system is       Continuous Word
required to perform some action as a result of the
signals, which it has received.                            6.3 Speaker Dependence

5.6. Homophones.                                            Speaker-dependent
                                                            Speaker-adaptive
     There are a number of words, which have different      Speaker-independent
spellings, and meanings, but which, nevertheless,          (Closed-speaker)
sound alike. For example, consider the words 'to', 'too'   Speaker-independent
and 'two'. In applications such as a speech-driven word
processor, homophones present problems. These                  7. RELEVANT ISSUES OF ASR DESIGN:
problems cannot be resolved at the acoustic or
phonetic levels. Recourse. Must be had to higher levels    Environment      Type of noise, Signal/Noise Ratio;
of linguistic analysis.                                                     Working Conditions
                                                           Transducer       Micro Phone, Telephone
5.7. Overlapping classes.                                  Channel          Band Amplitude; distortion, echo
                                                           Speakers         Speaker-dependence, speaker-
   The first and second formant frequencies (F1 &                           independence; sex age; physical state
F2) are able to identify most of vowels. Though, if        Speech Styles    Voice tone(quiet, normal, shout)
                  production (isolated words,
                  continuous speech, read or
                  spontaneous speech); speed (slow,
                  normal, fast).
Vocabulary        Characteristics of available training
                  data; specific or generic vocabulary


    8. THE FUTURE CHALLENGES OF ASR.
    Looking ahead, the ultimate challenge for
designers is to develop a system that can match the
ability of humans to recognize languages. According
to Microsoft's Whisper project on Speech research,
their goal is to "develop a general purpose, speaker-
independent. Continuous speech recognition engine
that can recognize Unrestricted. text and is effective
for command and control, dictation, and
conversational systems." While this ambitious goal
appears lofty, it may not be that far away.
Further research will continue to improve statistical
models for analyzing speech, not only by improving
mathematical algorithms that adapt to the unique style
of the speaker, but also by better control of the varying
environments from which speaker might use the
application. Researchers are expending a fair amount
of resources to strengthen the underlying statistics
behind their language models. However, speech
recognition devices will never proliferate the
mainstream market unless they are able to offer better
control of outside noise (e.g. in an office or ii, a car).

                9. CONCLUSION.
    This paper has discussed the problems in
building accurate and robust speech recognition
systems. We categorize these problems as
recognition units (phoneme, Syllables, Diphones
etc.), variability (Speaker, Context, environment
etc.) and ambiguity (homophones, word boundaries
etc.). The basic problem is the paradox that speech
consists of a continuous stream of sound with no
obvious discontinuities at the boundaries between
the words and at speech is perceived as a sequence
of words. It is almost impossible to predict
accurately the rate of progress in any scientific
field. However, based on the rate progressed over
the past decade, it seems reasonable to make some
broad projects as to where speech recognition is
headed in the next decade

                10. REFERENECES
[1] Digit Magazine
[2] www.netac.rit.edu
[3] www.cslu.ogi.edu

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:9/16/2011
language:English
pages:5