Automatic Speech Recognition Techniques
Today, speech recognition research is interdisciplinary, drawing upon work in fields as diverse as biology, computer science,
electrical engineering, linguistics, mathematics, physics, and psychology. Within these disciplines, pertinent work is being
done in the areas of acoustics, artificial intelligence, computer algorithms, information theory, linear algebra, linear system
theory, pattern recognition, phonetics, physiology, probability theory, signal processing, and syntactic theory.
Speech recognition systems are generally classified as discrete or continuous systems that are speaker dependent,
independent, or adaptive. Discrete systems maintain a separate acoustic model for each word, combination of words, or
phrases and are referred to as isolated (word) speech recognition (ISR). Continuous speech recognition (CSR) systems, on
the other hand, respond to a user who pronounces words, phrases, or sentences that are in a series or specific order and are
dependent on each other, as if linked together.
A speaker-dependent system requires that the user record an example of the word, sentence, or phrase prior to its being
recognized by the system; that is, the user "trains" the system. Some speaker-dependent systems require only that the user
record a subset of system vocabulary to make the entire vocabulary recognizable. A speaker-independent system does not
require any recording prior to system use. A speaker independent system is developed to operate for any speaker of a
particular type (e.g., American English). A speaker adaptive system is developed to adapt its operation to the characteristics
of new speakers.
ISR systems present a considerably easier task for machines than do CSR systems. Speaker-dependent systems are
simpler to construct and use and are more accurate than speaker-independent systems. As a result, the focus of early voice
recognition systems was primarily speaker-dependent isolated word systems that used limited vocabulary. At the time,
overcoming the restrictions in the state of technology required a greater focus on human-to-computer interaction. The
challenge was to identify how improved speech recognition technology could be used to support the enhancement of human
interaction with machines.
Most modern speech recognition uses probabilistic models to interpret a sequence of sounds. Hidden Markov models, in
particular, are used to recognize words. To increase word accuracy in speech recognition, language models are used to
capture the information that certain word combinations are more likely than others, thus improving detection based on
Automatic speech recognition performs poorly in noise, especially with crosstalk from other speakers. Humans are very
tolerant of noisy environments, but automated speech recognition degrades rapidly as noise increases. Signal corruption
from background speech in multiple-speaker environments is particularly troublesome. Biologically inspired neural networks
show promise for noise-tolerant spoken-language interfaces in such situations.
An important element in the creation of a speech recognition system is the size of the vocabulary. The vocabulary of a
speech recognition system affects the complexity, processing requirements, and the accuracy of the system. Obviously, it is
much easier to look up the definition of one of 20 words in 20-word dictionary rather than one of hundreds of thousands of
words in a Webster’s dictionary. That is essentially what the speech recognition software is doing; accessing a dictionary of
phonemes and words. [A phoneme, although difficult to describe, is basically the smallest unit of phonetic speech that
distinguishes one word from another. Every word can be broken down into units of individual sounds that make up that word.
Each of these units is a phoneme.]
Another important qualifier in the determination of the complexity of a speech recognition system is the type of speech that
the recognition systems uses; discrete or continuous. In a discrete speech system, the operator must pause between each
word, which makes the speech recognition task much easier. This is the simplest form of recognition to perform, because the
end points of words are easier to find, and the pronunciation of a word tends not to affect others. Thus, because the
occurrences of words are more consistent they are easier to recognize.
A continuous speech system operates on speech in which words are connected together, i.e., not separated by pauses.
Continuous speech is more difficult to handle because of a variety of effects. First, it is difficult to find the start and end points
of words. Another problem is "coarticulation." The production of each phoneme is affected by the production of surrounding
phonemes, and similarly the starts and ends of words are affected by the preceding and following words. The recognition of
continuous speech is also affected by the rate of speech (fast speech tends to be harder).
A dictionary specifies the legal acoustic models for individual speech sounds for all possible words to be used in the network.
Note that a dictionary may contain multiple pronunciations of the same word. There are usually two types of dictionaries, a
system dictionary and a user dictionary. The system dictionary contains a non-modifiable list of words that the development
software will recognize. It is sometimes possible to select a subset of the full system dictionary, which will be faster to load for
the speech recognition process. User dictionaries are usually created from scratch by the user, but can also be subsets of
the system dictionary.
The sets of acoustic models can be "trained" on speech recorded from a multiple number of users. These model sets take
into account such variations as pronunciation (dialect), accent, etc., for the individual speakers. It is important to train the
model sets in an environment similar to the one that will be used in the recognition process, i.e., do not train a model set
using a headset microphone if the recognition environment is going to involve access via telephone.
Very simply, a speech recognition process involves processing raw acoustic data through a recognizer, which matches the
acoustic data with a set of acoustic models using a decoder to generate a recognition hypothesis.
The speech recognition process begins with the digital sampling of the verbalized input of the user. This input might be from
a source gathered offline and stored on a disk, or directly from a real-time source of sampled data such as a workstation’s
The next stage is acoustic signal processing, where the digitized verbal input is split into a series of discrete "observations."
The hope here is that these observations are a faithful representation of the verbalized input from the user. Most techniques
include spectral analysis; e.g., Linear Predictive Coding (LPC) analysis, Mel Frequency Cepstral Coefficients (MFCC),
cochlea modeling, and others.
An attempt is then made to match the discrete observations with a known set of acoustic models. Each model at its core
represents a phoneme. A set of models is combined into a word or phrase using a dictionary. The dictionary specifies the
pronunciation of each word as a set of phonemes. This step can be accomplished by the use of a number of different
processes such as Hidden Markov Models (HMM), Dynamic Time Warping (DTW), Neural Networks (NNs), expert systems,
as well as combinations of these techniques. HMM-based systems are currently the most commonly used and most
During the recognition phase, the existing, trained acoustic models are compared with the processed voice input (discrete
observations). A decoder (e.g., Viterbi, Baum-Welch) is used to match the voice input with the most likely acoustic models as
the path is made through the network. The decoder transcribes the continuous speech input into a sequence of textual
symbols which an application can directly process. The goal is to match up the symbols into recognizable groups by
comparing them with the acoustic speech models.
The end product of this phase is the speech recognition process’ best approximation of the verbal input, in a form that can be
utilized by an Application Programming Interface (API). It is the API that utilizes the decoded verbal input from the speech
recognition process to allow an action to be performed on the original verbal input.