• Speech recognition is the process of converting a
speech signal to a sequence of words.
– Voice Dialing (e.g. Call Home)
– Call Routing (e.g. I would like to make a Collect Call)
– Data Entry (e.g. entering credit card number)
– Document Preparation (e.g. radiology report)
• Started w/ Alexander Graham Bell
• By discovering how to convert air pressure waves
(sound) into electrical impulses, he began the
process of uncovering the scientific/mathematical
basis of understanding speech.
• Discovered that a wire vibrated by a voice could
be made to vary its resistance and produce a
current when immersed in a conducting liquid.
• This lead to the invention of the telephone.
• 1950s, Bell Laboratories: 1st speech recognizer
• 1970s, ARPA Speech Understanding Research:
Objective of automatic speech recognition is the
understanding of speech not words.
• 1980s: Speaker-independent recognition of small
vocabularies, large-vocabulary voice recognition.
• Present: Real-time, continuous speech systems
that augment command, security, and content
creation tasks w/ exceptionally high accuracy.
What is Sound?
• Sound produced in speech are a traveling
wave, which is an oscillation of air pressure
• Sound is a form of energy
• Made up air molecules that vibrate
• Vibrates the air like a slinky!
• SI unit of frequency
• Freq = 1/T, T = Period
• means “one cycle per second” which can
also mean “one vibration per second”
• average human can hear frequencies in the
range of 20 Hz and 16,000 Hz
Difference Between a
Consonant and a Vowel
• Ferdinand de Saussure – higher degree of
aperture in oral cavity in consonants
• Leonard Bloomfield
– Vowels have ”modifications of voice-sound
that involve no closure, friction, or contact of
the tongue or lips”
– Consonants are “the other sounds”
• Chomsky and Halle:
– Vowel: air stream does not meet any major
obstacle or constriction in its way from the
lungs out of the mouth, and the articulation of
the sound allows spontaneous voicing
– Consonant: articulation of a consonant always
involves some kind of blocking of the air
– Many contemporary linguistics follow this view
What is voicing?
• Voiced means when tone is present (vibration of
• All vowels are voiced
• Some consonants are voiced
• Difficult to determine difference between
consonants and vowels through voicing
• Ex: L’s and R’s can be detected as vowels through
resonant frequencies from vocal tract
•A formant is a peak in an acoustic frequency
spectrum which results from the resonant frequencies
of any acoustical system.
•Formants are the distinguishing or meaningful
frequency components of human speech and of
•Formants are the characteristic partials that identify
vowels to the listener
• One of the most the widely used forms of speech
recognition is formant synthesis. At least three
formants are generally required to produce intelligible
speech and up to five formants to produce high quality
1st formant 150-850 Hz
2nd formant 500-2500 Hz
3rd formant 1500-3500 Hz
4th formant 2500-4800 Hz
Pythagoras and Sound
Pythagoras is credited with developing our
first basic understanding of the harmonic
It is said that one day when walking
through town, he noticed that the tone
created when a black smith struck his anvil
varied according to the weight of the
hammer. He expanded this idea to
experiments involving different lengths of
string held at equal tensions.
Pythagoras and Pitch
Pythagoras found that when you plucked a string at a certain
tension you got a tone at a particular pitch, but also that
plucking a string half the length of your original string would
cause it to vibrate twice as fast and produce a pitch one octave
higher in pitch.
One Octave Higher
Pythagoras and Intervals
After further experimentation, he found that dividing the
string into other proportions provided different musical
Fundamental Tone (1 : 1)
Octave (2 : 1)
Fifth (3 : 2)
Fourth (4 : 3)
Major Third (5 : 4)
Pythagoras and Waves
Through these experiments, Pythagoras determined that the
pitch (or frequency) of a sound varies inversely with the
length of the string it is vibrated on provided that the string
remains at the same tension and has a uniform density and
This relationship would later be the basis of the standing
wave equation we know from physics:
THE WAVE EQUATION
The wave equation is essential for the area of speech
recognition since every sound can be modeled as the wave
equation and so the correct and further understanding of the
wave equation will give us as well more understanding of the
nature of each individual sound. This problem can be easily
simplified by recognizing only the vowel and consonants. The
wave equations give us the answer.
ONE DIMENSIONAL WAVE
• As with all partial differential equations,
suitable initial and/or boundary conditions
must be given to obtain solutions to the
equation for particular geometries and
• metal, two-pronged device that produces a
tone when it vibrates
• Commonly used to tune instruments using
the central A note at 440 Hz.
• Tuning forks can also be used for hearing
• can tell if there is a problem with the nerves
themselves or with sound getting to nerves
• only twelve fixed notes of • The central A has a frequency
frequencies in each octave of 440 x 2^(n/12). To
• All notes centered around calculate C5 from A4, we
A4 note (440 Hz) have:
• A — (1) → A♯— (2) → B — (3) → C
• Notes contain the letter,
any sharp or flat
associated with it as well
as an octave number.
• Any note can be
because they are an
integer half step away
from the central A.
• Octaves automatically • used in the Musical
yield factors of two Instrument Digital
times the original Interface (MIDI).
• general formula used
to calculate each