Landmark-Based Speech Recognition
Shared by: HC121001131916
-
Stats
- views:
- 2
- posted:
- 10/1/2012
- language:
- English
- pages:
- 37
Document Sample


Landmark-Based Speech
Recognition:
Spectrogram Reading,
Support Vector Machines,
Dynamic Bayesian Networks,
and Phonology
Mark Hasegawa-Johnson
jhasegaw@uiuc.edu
University of Illinois at Urbana-Champaign, USA
Lecture 10. Phonology
• Foundations of distinctive feature theory
– Writing, phonemes, and the place-manner distinction
– A universal encoding of phonemes: Bell
– A universal binary encoding of phonemes: Jakobson
• Distinctive features and speech perception
– Information theory: Shannon
– Independence of feature errors: Miller
– Dependence of distinctive feature acoustics!!
• Distinctive features and speech articulation
– Rule-based modification of distinctive features: Halle
– Syllable boundaries and word boundaries
– Psychological processing, the lexicon, and surface phonology
– Spreading of autosegmental distinctive features: Goldsmith
• Unification of perceptual and articulatory accounts: Stevens
– The quantal theory of speech production
– Articulator-free features, articulator-bound features, and landmarks
– Language-specific features
– Redundant features
A Brief History of Writing
(Diamond, Guns, Germs, and Steel, 1999)
• Writing developed 5000 years ago independently in at least three places
– Mesopotamia (modern Iraq): recording commercial transactions
– Egypt (first dynasty, 3150 BCE): recording imperial conquest
– China (Shang dynasty): recording divinations
• Symbols were meaning-based (word glyphs)
• A word glyph can also be used to record similar-sounding nonsense
words or foreign words
– “Narmur,” first pharoah of Egypt, 3150 BCE
– “Meiguo” = “America” in modern Chinese
• Phoenicians (3000 years ago) were the first to abandon word-glyphs in
favor of a purely sound-based writing system
– Phoenicians built trading empires based in Lebanon (Tyre and Sidon) and
Carthage; their empires were independent until Roman conquest
– “Phoenician” is the Greek name for them
– The 22 symbols of their alphabet were based on word glyphs (e.g.,
“alpha”=“ox”), but they only used the 22 that were necessary to write all
sounds of their language, and discarded the remainder
– “Phoneme” means, roughly, “Phoenician unit”
A Few Phonemic & Syllabic Writing
Systems
• Semitic, related to Phoenician symbols:
– Hebrew ( – ה ,ד ,ג ,ב ,אalef, bet, gimel, dalet, he)
– Greek (a, b, g, d, e - alpha, beta, gamma, delta)
– Cyrillic (а, б, в, г, д – a, be, ve, ghe, de)
– Latin (a, b, c, d, e, …)
– Arabic ( – ح ,ج ,ت ,ب ,اalef, beh, teh, jeem, hah)
• Nubian (modern Ethiopia; independently developed in
response to Egyptian writing)
• Indian: Hindi, Punjabi, Bengali, Gujarati, Grantha, …
• East Asian: Korean (hangul), Japanese (hiragana,
katakana), Tagalog (baybayin)
• American Indian: Cherokee (independently designed in
response to English)
Manner/Place Distinction: Hangul
(King Sejong of Joseon, Hunmin jeongeum, 1446)
• Origin of Hangul
– Attributed to: King Sejong of Joseon (Korea), 15th century
– Goal: Literacy for commoners who could not read Chinese
– Therefore: System should be as easy as possible to memorize
• Innovation: manner/place distinction
– Place of articulation encoded using symbols representing the shape
of the articulator
• 5 places of articulation: labial (/m/:ㅁ) alveolar (/n/:ㄴ), dental? (/s/:ㅅ),
velar (/g/:ㄱ), glottal (/ŋ/:ㅇ)
– Manner of articulation is changed from nasal/fricative to stop/affricate
by adding one stroke
• /m/→/b/, /n/→/d/, /s/→/ǰ/
– Obstruent is made unvoiced encoded by adding a second stroke
• /b/→/p/, /d/→/t/, /ǰ/→/č/, /ŋ/→/h/
• Exception: /g/=base symbol for velar place, /k/ has just one extra stroke
Universal Encoding of Phonemes
Using a Manner/Place Encoding
(Bell, Visible Speech, 1876)
• Goal: an international phonetic alphabet
– If any language in the world distinguishes two sounds,
then Bell’s alphabet should distinguish them
• Proposed encoding:
– Base symbol = half-moon or “C” shape
– Place encoding: angle of the signal
– Manner and voicing: extra strokes across the symbol
• (End result: system was scrapped because it was
too expensive to typeset using movable-type
printing. Modern IPA was developed instead).
Universal Binary Encoding of
Phonemes
(Jakobson, Fant, and Halle, 1952)
• Jakobson’s minimalist program:
– An 8-way place distinction is composed of exactly 3 binary distinctive
features
– Distinctive features are not arbitrary; they are based on physical properties
of the sound (its articulation or its acoustics)
• Contribution of the engineer (Fant):
– Distinctive features are based on spectral shape
• Jakobson’s consonantal distinctive features (in most cases, the first
feature is more frequently used):
– Acute vs. Grave (spectrum tilts up, like /,t,s,č,š/, or down, like /p,f,ţ,ş,k,h/)
– Distributed vs. Compact (broad distribution of energy, like /m,p,f,,n,t,s/, or
one narrow peak, like /č,š,ţ,ş,k,h/, or )
– Strident vs. Mellow (high-energy frication, like /s,š,ş/, or low-energy
frication, like /f,,h/
– Voiced vs. Unvoiced (/b,v,δ,d,z,ǰ,ž,g/ vs. /p,f,,t,s,č,š,k/)
– Sonorant vs. Obstruent (unobstructed voicing, like /m,n,ng/, vs. obstructed
voicing, like /p,b,f,v,t,d,s,z,k,g/)
Binary Encoding of Phonemes
(Jakobson, Fant, and Halle, 1952)
• Jakobson’s minimalist program:
– All phoneme distinctions are binary. For example, if there are 10 places of
articulation in the world’s languages, there must be ceil(log2(10))=4 binary
distinctive features to encode place of articulation.
– Distinctive features are not arbitrary; they are based on physical properties
of the sound (its articulation or its acoustics)
• Apparent contribution of the engineer (Fant):
– Distinctive features are based on spectral shape
• Jakobson’s consonantal distinctive features:
– Acute vs. Grave (spectrum tilts up, like /,t,s,č,š/, or down, like /p,f,ţ,ş,k,h/)
– Diffuse vs. Compact (broad distribution of energy, like /m,p,f,,n,t,s/, or one
narrow peak, like /č,š,ţ,ş,k,h/, or )
– Strident vs. Mellow (high-energy frication, like /s,š,ş/, or low-energy
frication, like /f,,h/
– Voiced vs. Unvoiced (/b,v,δ,d,z,ǰ,ž,g/ vs. /p,f,,t,s,č,š,k/)
– Nasal vs. Non-nasal (unobstructed voicing, like /m,n,ng/, vs. obstructed
voicing, like /p,b,f,v,t,d,s,z,k,g/)
– … etcetera. There were 12 features in the original set.
Speech Perception:
Distinctive Features as an
Encoding for a Communications
Channel
Information Theory
(Shannon, 1948)
• A Mathematical Definition of the word
“Information”
– “Entropy” = the degree to which the outcome of
some random process is impossible to predict
• H = − ∫p(x)log2(p(x))dx = −E[log2p(x)]
• Why it makes sense: the entropy of a coin thrown
twice is 2X the entropy of a coin thrown once
(p(x1,x2)=p(x1)p(x2), so H(x1,x2)=H(x1)+H(x2))
• One “bit” = the amount of entropy in one unbiased
coin toss
– “Information” = the amount by which your
uncertainty about x is reduced if you know y:
• I(x,y) = H(x)−H(x|y)
Information Theory
(Shannon, 1948)
• “Channel capacity” = the rate at which information
can be conveyed through a noisy channel, in
bits/second
– CAB=maxA(I(x,x*)) for x=input to the channel per second,
A(x)=symbols transmitted over channel, B(x)=recovered
symbols at receiver, x*=maximum posterior probability
estimate of x given A*(x).
• Shannon’s theorem:
– Given a decoding machine with enough memory, it is
always possible to achieve channel capacity using an
A(x) of the following form:
• Encode x using the smallest possible number of bits
• Add redundancy until bit rate is reduced to CAB, then transmit.
Miller’s Linguistic Interpretation of
Information Theory
(Miller and Nicely, 1955)
• The “channel” for speech = acoustic channel
– Information can be conveyed over 32 critical bands of the human ear
– Rate at which information can be transmitted in each band depends
on SNR in band
• The minimal binary encoding of speech = distinctive
features
• The “encoding” is the speech production mechanism: slow
movements of the tongue and lips add redundancy matched
to the noise conditions of the channel
– When the channel is bad, speaker slows down and speaks clearly
– When channel is good, the main source of error is pronunciation
variability, i.e., talker leaves out phonemes that are only important for
redundancy
In the Perceptual Space, Distinctive
Feature Errors are Independent
(Miller and Nicely, 1955)
• Experimental Method:
– Subjects listen to nonsense syllables mixed with noise (white noise
or BPF)
– Subjects write the consonant they hear
• Results:
p(q*|q,SNR,BPF) ≈ Pi p(fi* | fi,SNR,BPF)
q* = consonant label heard by the listener
q = true consonant label
F*=[f1*,…,f6*] = perceived distinctive feature labels
F=[f1,…,f6] = true distinctive feature labels
[±nasal, ±voiced, ±fricated, ±strident, ±lips, ±blade]
Consonant Confusions at -6dB SNR
P T K F TH S SH B D G V DH Z ZH M N
P 80 43 64 17 14 6 2 1 1 1 1 2
T 71 84 55 5 9 3 8 1 1 1
K 66 76 107 12 8 9 4 1 1
F 18 12 9 175 48 11 1 7 2 1 2 2
TH 19 17 16 104 64 32 7 5 4 5 6 4 5
S 8 5 4 23 39 107 45 4 2 3 1 1 3 2 1
SH 1 6 3 4 6 29 195 3 1
B 1 5 4 4 136 10 9 47 16 6 1 5 4
D 8 5 80 45 11 20 20 26 1
G 2 3 63 66 3 19 37 56 3
V 2 2 48 5 5 145 45 12 4
DH 6 31 6 17 86 58 21 5 6 4
Z 1 1 17 20 27 16 28 94 44 1
ZH 1 26 18 3 8 45 129 2
M 1 4 4 1 3 177 46
N 4 1 5 2 7 1 6 47 163
Distinctive Features: ±nasal, ±voiced, ±fricative, ±strident
In the Acoustic Space, Distinctive
Features are Not Independent
(Volaitis and Miller, 1992)
• [±voiced] for English
stops is mainly cued
by voice onset time
(VOT).
• VOT is also an cue
for place of
articulation: velar >
alveolar > labial
• p(place,voicing|VOT)
≠ p(place|VOT)
p(voicing|VOT)
Speech Production:
Distinctive features explain the
related pronunciations of
morphologically related words
The English Plural
• Three standard plural endings:
– Cat → cats (/s/)
– Dog → dogs (/z/)
– Dish → dishes (/Əz/)
• Observation: you can predict the correct plural by
observing the last phoneme of the word
• Algorithm #1: Create a lookup table
Use This Plural… If Last Phoneme of the Word Is:
/s/ /p,t,k,f,θ/
/z/ /b,d,g,v,δ,m,n,ŋ,r,l,y,w/, any vowel
/Əz/ /s,č,š,z,ǰ,ž/
The English Plural
(Chomsky and Halle, Sound Pattern of English, 1968)
• Algorithm #2: Create 3 rules
1. Use /Əz/ if the last phoneme is [+strident]
– PLURAL → /Əz/ | [+strident]a
2. Otherwise, use /s/ if last phoneme is [-voiced]
– PLURAL → /s/ | [-voiced]a
3. Otherwise, use /z/.
– PLURAL → /z/
• General form of phonological rules: a becomes b
in the context gad:
– a→b | gad
Stress Shift in English
• Every English word has one stressed syllable. Usually
but not always it is the antepenultimate syllable (3rd from
the end) if the word has 3 or more syllables:
– Professor -- (an exception to the antepenult rule)
– Establishment
• Some suffixes cause the stress to shift:
– Professorial
– Establishmentarian
• Some don’t
– Professorship
– Establishmentarianism (stays where it was before ism)
• Prefixes never cause a stress shift
– Unprofessorial
– Antidisestablishmentarianism
Morphemes: Roots, Suffixes, and
Prefixes
• Suffixes and prefixes are examples of “bound
morphemes”
• “Morpheme” = a phoneme sequence with a meaning
– Root words (establish)
– Words that are part of compounds, e.g., rain|fall
– Suffixes and prefixes
• “Bound morpheme” = word that never occurs on its own
– In English: most suffixes and prefixes
– In Chinese: perhaps some words that used to be independent,
but now only appear as bound morphemes, e.g., “hu” in
“hutong”?
Morpheme and Word Boundaries
(Chomsky and Halle, Sound Pattern of English, 1968)
• Chomsky & Halle’s model of speech production:
1. Morphemes of a sentence are sequenced
– the | anti | dis | establish | ment | arian | s | went | home
2. Boundaries between roots and their type-1 suffixes are erased
– the | anti | dis | establishmentarian | s | went | home
3. Type 1 rules, like the “antepenultimate stress rule,” are applied
– the | anti | dis | establishmentarian | s | went | home
4. Boundaries between roots and type-2 suffixes are erased.
Recent proposal: some function words like “the” also attached at
this point, creating a unit called a “prosodic word:”
– the-antidisestablishmentarians | went | home
5. Type 2 rules apply:
– theyantidisestablishmentarians | went | home
6. All remaining boundaries are erased, and speech is produced
– the-antidisestablishmentarians | went | home
Psychological Processing and the
Lexicon
• Experiment: play a word. Subject hits “W” if it’s a
meaningful word (“empower”), “N” if it’s a nonsense word
(“empriffle”).
• Measure the subject’s reaction time. Adjust for word
length.
• Result: subjects recognize two-morpheme words
(“dis+like”) just as fast as one-morpheme words
(“power”).
• Apparent conclusion: Chomsky & Halle’s rules are
usually “precompiled,” not applied in real time.
– If a listener has heard the word frequently enough, it is stored in
her mental lexicon as a whole word.
– If it’s not stored in her mental lexicon, she tries to figure it out by
morphological parsing (“zookeepership”).
– If morphological parsing fails, then she concludes it’s a nonsense
word.
Inter-Word Phonology
• Many words change form in particular contexts
– “this ship” → “thish ship”
• Usually: Only a very small number of phonological rule types can
apply across word boundaries.
– Place assimilation: /s/ in “this” takes the palatal place of /sh/ in “ship”
– Manner assimilation: “in the”→“in ne” with a dental /n/
– In Chinese: tone sandhi (tones change because of tone context)
• These changes occur very frequently
• Perhaps: MOST of the changed forms are very uncommon, so they
are not stored in the mental lexicon, but SOME of the most common
and most severely reduced forms may actually get their own lexical
entry so the listener can respond more quickly:
– “did you” → “didja”
– “I don’t know” → “ǣəo”
– These may be comparable to “multiword” lexical entries in a speech
recognition lexicon
Autosegmental Phonology
(Goldsmith, 1975)
• Inter-word phonological rules all have a simple form: manner or
place assimilation
• Hypothesis: instructions to the speech articulators are arranged in
“autosegmental tiers,” i.e., on a kind of musical score with
asynchronous rows
• Assimilation = feature spreading
/s/ /sh/ /sh/ /sh/
[-nasal] [-nasal] [-nasal] [-nasal]
[+strident] [+strident] [+strident] [+strident]
[+blade] [+blade] [+blade]
[+anterior] [-anterior] [-anterior]
Quantal Theory:
Distinctive Features are not
“just perceptual” or “just
articulatory:” they arise from the
relationship between articulation
and perception
The Speech Chain
(Stevens, Speech Communication, 1999)
The Speech Chain
Speech
Speech
Production
Production
Discrete Planning Continuous
Inputs Muscle
(Words, Activation Acoustic
Phonemes, Levels Signal
Distinctive a(t) x(t)
Features)
Speech Auditory
Perception Perception
Discrete Auditory
Outputs Nerve
(Words, Signals
Nonsense) y(t)
Nonlinearities in the Speech Chain
• The mappings P:a(t)→x(t) (speech production) and
R:x(t)→y(t) (perception) are highly nonlinear
– We have very good models, going back to 1940s
– Pick a particular vector a(t); we can usually estimate ∇P(a(t)), the
local gradient of P. (Likewise R).
• There are many sigmoidal nonlinearities in both P and R.
– Articulator position, a(t), can vary quite a bit without affecting the
acoustics, as long as a(t) stays within the stable region (a<a1* or
a>a2*)
– If a(t) crosses the transition region, acoustics change a lot!
Acoustics, x(t)
Articulation, a(t)
Stable a1* a2* Stable
region region
The Quantal Theory of Speech Production
(Stevens, 1989)
• The distinction between a>a2* and a<a1* is a robust
distinction
– In the case of P: robust to minor pronunciation variability
– In the case of R: robust also to lots of added noise
– Therefore:
• Put this distinction into Shannon’s communication alphabet in order to
maximize the mutual information I(y,a) between produced speech and
perceived speech.
Acoustics, x(t)
Articulation, a(t)
Stable a1* a2* Stable
region region
The Quantal Theory of Speech Production
(Stevens, 1989)
• Hypothesis: Every binary distinctive feature, in every
language of the world, is a distinction (a>a2* vs. a<a1*)
near a sigmoidal nonlinearity of P, or a distinction (x>x2*
vs. x<x1*) near a sigmoidal nonlinearity of R.
• Different languages choose different nonlinearities to
focus on, but the number of useful sigmoids in P and R is
finite, thus the number of distinctive features in the world’s
languages is finite.
Acoustics, x(t)
Articulation, a(t)
Stable a1* a2* Stable
region region
Examples
• Feature [anterior]: a nonlinearity of P
– “Alveolar ridge” is a sigmoid-shaped bump in hard palate
– Moving tongue tip back 1cm, over the alveolar ridge, increases front
cavity length by 2cm, causing big change in the front cavity resonance
frequency
– Result: /s/ → //, [+anterior] → [−anterior]
• Feature [sonorant]: a nonlinearity of P
– Opening the soft palate by just 2mm during /d/ allows sonorant voicing to
continue throughout closure
– Result: energy increases 20-30dB during closure
– /d/→/n/, [−sonorant] → [+sonorant]
• Feature [back]: a nonlinearity of R
– When |F2-F1|<3 Bark, both formants excite the same neurons, causing the
perception of a single broad formant peak
– When |F2-F1|>3 Bark, 2 distinct formant peaks are perceived
– |F2-F1|<3 Bark: [+back] vowels /a,o,u/
– |F3-F2|<3 Bark: [−low,−back] vowels /i,e/
– All formant peaks distinct: [+low,−back] vowels /ae,E/
What About Unused Sigmoids?
• If a language doesn’t use a particular sigmoid, its
listeners can do one of two things:
1. Learn to ignore that distinction.
2. Use that sigmoid to “enhance” some other distinction.
1. Perceptual Magnet Effect
(Kuhl, 1992)
• Perceptual magnet effect shows that, from infancy, babies learn a
neural map that enhances linguistically useful sigmoids, and smooths
out the less useful sigmoids.
• Experiment: ask listeners to determine whether vowels are “same” or
“different.
• Result: accuracy is best near the boundary between phonemes in
your own language. Location of boundary is native-language-
dependent.
2. Enhancement Features
(Stevens and Keyser, Language, 1989)
• Example: [voiced] in English.
– Based on [voiced] distinction in Latin, which is a
distinction between stops with/without voicing during
closure.
– Long VOT enhances the perceptual sense that a stop
is devoiced, therefore, over some period of history, [-
voiced] stops became [+aspirated].
• Some languages distinguish [+voiced,+aspirated],
[+voiced,−aspirated], [−voiced,−aspirated], and
[−voiced,−aspirated] stops. Such languages could not use
[+aspirated] to enhance the perception of [−voiced].
– In modern English, the “enhancing” feature is so strong
that the “primary” feature (closure voicing) is often
dropped.
Articulator-Bound and Articulator-Free
Features
• Articulator-bound features can only be
implemented by one articulator
– [anterior] is bound to the tongue blade
– [front] is bound to the tongue body
– [voiced] is bound to the vocal folds
– [nasal] is bound to the soft palate
• Articulator-free features can be implemented by
the lips, tongue blade, or tongue body
– Key articulator-free features: [sonorant,continuant]
• [+sonorant,+continuant] = vowel or glide
• [+sonorant,-continuant] = nasal
• [-sonorant,-continuant] = stop
• [-sonorant,+continuant] = fricative
Landmarks
• “Primary articulator” of a consonant is the
articulator that implements [−sonorant] or
[−continuant] for that phoneme
– Lips, tongue blade, or tongue body
– “Implements [−continuant]” = articulator closes
completely
• At the moment when the primary articulator
closes, there is a BIG change in the acoustics
– [+sonorant]→[−sonorant]: 10dB at low freqs
– [+continuant]→[−continuant]: 10dB at high freqs
• This particular nonlinear change is called an
“acoustic landmark”
Summary
• Brief history:
– Writing: ~5000 years, Phonemes: ~3000 years
– Manner-place notation: ~500 years, Binary encoding: ~50 years
• Speech perception:
– Errors in the perception of different distinctive features are independent
– … even though the acoustic correlates of different distinctive features are
NOT independent.
• Speech production:
– Phonemic relationships among morphologically related words can be
explained using distinctive features
– Inter-word phonology is simple but universal: spreading of manner or
place features on autosegmental tiers
• Quantal theory:
– Distinctive features arise from sigmoids in the mapping
– Any language chooses a subset of sigmoids
– Other sigmoids are ignored, or else used for feature enhancement
– A very important sigmoid: primary articulator closure produces landmarks
Get documents about "