Phonology and the art of Automatic Speech Recognition
Mark Hasegawa-Johnson ECE Department and the Beckman Institute for Advanced Science and Technology University of Illinois Urbana-Champaign, Illinois, USA
Outline
• A Brief History of Ideas • The Prosodic Hierarchy • The Utterance
– End-of-Turn Detection
• The Intonational Phrase
– Prosody-Dependent Speech Recognition
• The Word
– Articulatory Phonology Models of Coarticulation
• The Syllable
– Landmark-Based Speech Recognition – Audiovisual Speech Recognition
• Conclusions
A Brief History of Ideas: Global
Mechanics
Science: 1687 (Newton’s Principia) Technology: 1825 (Stockton & Darlington Railroad opens) Human Benefits: 1850 (World per capita GDP, $800 in 2005 dollars, annual growth rate rises to 1%)
Electricity and Magnetism
Science: 1745 (van Muschenbroek invents Leyden jar) Technology: 1876 (Bell invents telephone) Human Benefits: 1950 (World per capita GDP, $2100, annual growth rate rises to 3%)
Spoken Communication
Science: 1867 (Bell proposes “Universal Alphabetic”) Technology: 1978 (TI sells the “Speak and Spell”) Human Benefits: 2045 (language-independent markets for capital and intellectual talent drive the world per capita GDP, $30000, to a growth rate above 4% annually)
A Brief History of Ideas: Local
The Prosodic Hierarchy
Based on ideas of Selkirk, 1981; Nespor and Vogel, 1986
End-of-Turn Detection
Reported research was performed by Kyle Gorman advised by Cole, Fleck, and Hasegawa-Johnson
Prosody-Dependent Speech Recognition
Based on ideas of Ostendorf, Byrne, Shriberg, Talkin, Waibel et al., 1996 Reported research was performed by Ken Chen, Sarah Borys, and Sung-Suk Kim advised by Cole and Hasegawa-Johnson
Landmark-Based Speech Recognition
Based on ideas of Stevens, Manuel, Shattuck-Hufnagel, and Liu, 1992 Reported research was performed by Sarah Borys and Karen Livescu advised by Niyogi, Glass, Espy-Wilson, and Hasegawa-Johnson
Audiovisual Speech Recognition
Based on the algorithms of Chu and Huang, 2001 Reported research was performed by Ming Liu, Kate Saenko, Partha Lal, Mark Hasegawa-Johnson, Karen Livescu, Özgur Çetin
The Prosodic Hierarchy
C1: Utterance C2: Intonational Phrase C3: Intermediate Phrase C4: Prosodic Word
C5: Foot
Wan ted Chief Jus tice
of
the
Ma
ssa
chu
setts
Su
preme
Court
C6: Syllable
1. 2. 3. 4.
Layered Constituents: Ci can only dominate Ck for k > i Headed Constituents: Each Ci dominates at least one Ck Non-Recursive Layering: No Ci dominates a Ci Exhaustive Layering: No Ci dominates a Ci+2
Prosody: The Units of Articulatory Planning and Perception
Processes bounded within the Utterance
turn-taking cues, e.g., pause, duration, pitch, lexical cues
Processes bounded within the Intonational Phrase
sequencing/stair-stepping of pitch accents phrase-final pitch effects: declarative fall, question rise, …
Processes bounded within the Intermediate Phrase
phrasal stress/ pitch accent phrase tone
Processes bounded within the Prosodic Word
co-articulation
Processes bounded within the Foot
vowel reduction, lexical stress
Processes bounded within the Syllable
abrupt onset, syllabic nuclear peak, abrupt offset
The Utterance
End-of-Turn Detection ≠ Pause Detection
(Local, Kelly and Wells, 1986)
Prosodic Features on Utterance-Final Word Can be Automatically Detected
(Ferrer, Shriberg and Stolcke, 2002-3)
Final word longer than a typical production of “oven”
“Declaration fall:” pitch falls on utterance-final word
“fire” has increased duration suggesting a possible EOT, but… “it” is very short, and ends abruptly with glottal stop.
Prosodic Features for EOT Detection
(Gorman, Cole, Hasegawa-Johnson and Fleck, LSA 2007)
Pause Features:
Silence Instant-response classifiers: truncate above pauses after 80, 100, …, 300ms, results are stable with pause duration truncated at 300ms
Duration Features:
Normalized last stressed vowel duration Last stressed rhyme duration Last rhyme duration
Pitch Features:
Minimum or median, last word or last N frames F0 slope of word (continuous), or at boundary (categorical)
Context Features:
Speaker gender Number of words since turn beginning Length of previous pause
Prosodic Features for EOT Detection
(Gorman, Cole, Hasegawa-Johnson and Fleck, LSA 2007)
Classifier Chance Pitch Duration
Error Rate (Percent) 50 48.7 69.6
Pause
All Features + Context Pause+Duration+Context Pause+Pitch+Context
10.2
6.4 6.2 5.4
The Intonational Phrase
Intonational Phrases and Pitch Accents
• Tagged Transcription:
Wanted*B4 chief* justice* of the Massachusetts* supreme court*B4 – B4 denotes intonational-phrase-final word – * denotes pitch accented word
• Data: Boston Radio Speech corpus
– 7 talkers; Professional radio announcers – About 3.5 hours of speech prosodically transcribed (ToBI = “tones and break indices” notation) – Largest prosodically transcribed database, but… – Less than 1% of size of speech recognition training databases: e.g., too small to train triphones
Intonational Phrases and Pitch Accents: a Bayesian Network Model of Speech
(Chen, Hasegawa-Johnson et al., 2003)
Frame Level
Y H P S M
X Q W
Segmental Level
Word Level
X: acoustic-phonetic observations Y: acoustic-prosodic observations Q: phonemes H: phone-level prosodic tags W: words P: word-level prosodic tags S: syntax M: message
Intonational Phrases and Pitch Accents: Prosody-Dependent Speech Recognition
X,Y Q,H
•
ˆ [W ] arg max p(O | Q, H ) p(Q, H | W , P) p(W , P) Advantages:
– Natural extension of standard prosody independent speech recognition – Allow the convenient integration of useful linguistic knowledge at different levels – Flexible – Fast
W,P S M
• Disadvantage:
– Requires prosodically-transcribed training data
The Lexicon: p(Q,H|W,P)
• Tagged Transcription:
Wanted*B4 chief* justice* of the Massachusetts* supreme court*B4
• Lexicon:
– Each word has four entries • wanted, wanted*, wantedB4, wanted*B4 – IP boundary applies to phones in rhyme of final syllable • wantedB4 w aa n t axB4 dB4 – Accent applies to phones in lexically stressed syllable • wanted* w* aa* n* t ax d
The Acoustic Model: p(X,Y|Q,H)
In order to train on such a small database, we propose a Factored Acoustic Model: p(X,Y|Q,A,B) = Pi p(di|qi,bi) Pt p(xt|qi) p(yt|qi,ai)
– – – – – – prosody-independent phone label qi € {aa,ae,b,d,…} pitch accent type ai € {Accented,Unaccented} intonational phrase position bi € {Final,Nonfinal} di = duration of phone qi (explicitly modeled) xt = standard speech recognition features (MFCC) yt = nonlinearly transformed pitch features
Acoustic-Prosodic Observations: yt = ANN(lnf0(t-5),…,lnf0(t+5))
Explicit Duration HMM: Phrase-Final vs. Non-Final Duration Histograms
/a/ phrase-medial and phrase-final
/c/ phrase-medial and phrase-final
A Factored Language Model
Unfactored pi-1,wi-1 Factored pi-1 wi-1 pi wi 2. • • pi,wi 1.
Prosodically tagged words:
cats* climb trees*%
Unfactored: Prosody and word string jointly modeled:
p( trees*% | cats* climb )
Factored: Prosody depends on syntax:
p( w*% | N V N, w* w )
si-1
si
Syntax depends on words:
p( N V N | cats climb trees )
Result: Syntactic Mediation of Prosody Reduces Perplexity and WER
pi-1 wi-1
pi
wi
Factored Model:
Reduces Perplexity by 35% Reduces WER by 4%
si-1
si
Syntactic Tags: For pitch accent:
• POS sufficient
For IP boundary:
• Parse information useful if available
Syntactic Factors: POS, Syntactic Phrase Boundary Depth
45 40 35 30 25 20 15 10 5 0 Accent Boundary Prediction Error Prediction Error
Chance POS POS + Phrase
Results: Word Error Rate (Boston Radio Speech Corpus)
25 24.5 24 23.5 23 22.5 22 21.5 21 20.5 20 Word Error Rate
Baseline PD Acoustic PD Language PD Both
The Word
Coarticulation: the Gestural Phonology Model
(Browman and Goldstein, 1992)
“everybody” → “erwodi”
Lips Tongue Glottis
/v/ /r/ /b/ /ah/ /iy/
/r/
/eh/ /v/ /eh/ /r/ /b/ /ah/
/d/
/iy/ /d/ /iy/
time
Simplified Gestural Phonology: Coarticulation = Lazy Articulators
(Livescu and Glass, 2004)
Pronunciation Model: Dynamic Bayesian Network (DBN) with Partially Asynchronous Articulators
(Livescu and Glass, 2004)
• wordt: word ID at frame #t • wdTrt: word transition? • indti: which gesture, from the canonical word model, should articulator i be trying to implement? • asyncti;j: how asynchronous are articulators i and j? • Uti: canonical setting of articulator #i • Sti: surface setting of articulator #i
Articulatory Feature DBN Experiments
• Background:
– [Livescu and Glass, 2004]:
• AF model predicts mapping from canonical pronunciations to human transcriptions better than HMM
– [Saenko and Livescu, 2006]:
• AF model recognizes visible speech better than HMM
• To be presented today:
– AF model as part of a landmark-based speech recognizer: [Hasegawa-Johnson, …, Livescu, et al., WS04]: – AF model for audiovisual speech recognition: [Livescu, …, Hasegawa-Johnson, et al., WS06]
The Syllable: LandmarkBased Speech Recognition
What are Landmarks?
• Instants of rapid spectral change (dX/dt). • Instants of high spectral entropy (H((X(t)|X(t-1))). • Instants of high mutual information between phoneme and signal (I(q;X(t-d),…,X(t+d)).
Where do these things happen?
• Consonant closures (fricative, stop, nasal) • Consonant releases (fricative, stop, nasal) • Syllable nuclei
Landmark-Based Speech Recognition
MAP transcription: … backed up …
Syllable Structure
ONSET NUCLEUS
ONSET NUCLEUS CODA
Search Space: … … buck up … … big dope … … backed up … … bagged up … … … big doowop … …
CODA
Stop Detection using Support Vector Machines
(Niyogi, Ramesh, and Burges, 1999, 2002)
False Acceptance vs. False Rejection Errors per 10ms frame, Four Types of Stop Detectors
(1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2%
(2) HMM (*): False Rejection Error=0.3%
(3) Linear SVM: EER = 0.15% (4) RBF SVM: Equal Error Rate=0.13%
Two Types of SVMs: Landmark Detectors (p(landmark(t)), Landmark Classifiers (p(place-features(t)|landmark(t))
2000-dimensional acoustic feature vector
SVM
Discriminant yi(t) Sigmoid or Histogram Posterior probability of distinctive feature p(di(t)=1 | yi(t))
Acoustic Feature Vector: Local Cepstrogram, Formants, Auditory Modeling Features Covering +/-70ms
SVM Training: Accuracy, Per Frame, in Percent (Chance=50%)
Train Test Kernel speech onset speech offset consonant onset consonant offset continuant onset continuant offset sonorant onset sonorant offset syllabic onset syllabic offset NTIMIT NTIMIT Linear 95.1 79.6 94.5 91.7 89.4 90.8 95.6 95.3 90.7 90.1 RBF 96.2 88.5 95.5 93.7 94.1 94.9 97.2 96.4 95.2 88.9 NTIMIT Switchboard Linear 71.4 65.3 70.3 80.3 69.1 69.3 85.2 75.6 69.5 54.4 RBF 62.2 78.6 72.7 86.2 81.9 68.8 86.5 75.2 78.9 60.8 Switchboard Switchboard Linear 81.6 68.4 95.8 92.8 86.2 89.6 96.3 95.2 87.9 88.2 RBF 81.6 83.7 97.7 96.8 92.0 94.3 96.3 96.4 92.6 89.7
SVM/HMM Hybrid
(Borys and Hasegawa-Johnson, ICSLP 2005)
• • • •
10 landmark-detection SVMs 23 landmark-classification SVMs Acoustic features: MFCC+d+dd, formant freqs+amps HMM baseline speech recognizer: 3 states per phone, constrained only by a phoneme bigram • Raw real-valued SVM discriminant output fed to HMM, modeled there using mixture Gaussian PDFs, as in the “tandem” NN/HMM hybrid (Ellis et al., 2000) Phone Error Rate MFCCs+d+dd SVM Tandem 63.9% 62.7%
DBN-SVM Model of Pronunciation Variability
(Hasegawa-Johnson, Baker, …, Livescu et al., WS04, ICASSP 2005)
Word
LIKE Tongue Mid
A Tongue open
Canonical Form Tongue front Tongue closed Surface Form Manner Place Tongue front Palatal Semi-closed Glide
…
Tongue Front Tongue open Front Vowel
…
…
SVM Outputs
x: Multi-Frame Observation
including
p( gPGR(x) | palatal glide release)
p( gGR(x) | glide release )
Spectrum, Formants, & Auditory Model
SVM/DBN Hybrid: Design Decisions
• SVM Applied: When should SVM supply place feats to the DBN?
– Landmarks: Only at SVM-detected landmarks – Frames: In every frame
• Use Place?: In frame-based hybrid, how use place features be used?
– Always: use place features for segmentation and recognition – Recognition: use place features only for recognition – Selective: use only the high-accuracy place features; ignore others
• Probs: How should SVM information be passed to the DBN?
– Posterior: SVM output converted to a posterior probability – Likelihood: Posterior is normalized to estimate a pseudo-likelihood
• DBN Training: How should DBN be trained?
– Manual: Using manual landmark transcriptions (ICSI Switchboard) – SVMs: Using SVM-detected landmarks
Landmark-Based Speech Recognizer used to Rescore the 2003 SRI Decipher System
(Hasegawa-Johnson, Baker,…, Livescu, et al., WS04, ICASSP 2005) For each word hypothesis generated by the SRI Decipher Speech Recognizer: – SVM probabilities computed during word hypothesis, input to DBN – DBN computes a score S P(word | evidence) – Final edge score is a weighted interpolation of first-pass speech recognizer scores, together with the DBN score
SVM Applied Baseline: Landmarks Landmarks Landmarks Frames Always Frames Recognition Frames Selective Use Place? Probs DBN Training 3-speaker WER (# errors) 27.7 (550) 27.6 (549) 27.3 (543) 27.3 (543) >100% 27.3 (542) 27.2 (541)
SRI Decipher Recognizer as of Fall 2003 Likelihood Manual Posterior Posterior Posterior Posterior Posterior Manual SVM SVM SVM SVM
The Syllable: Audiovisual Speech
Why Use Visual Information?
Visuals are unaffected by noise Human listeners use visuals in quiet:
McGurk and MacDonald, 1976: listeners unable to hear a /b/ if see lips that stay open
Human listeners use visuals in noise:
Sumby and Pollack, 1954: visible talker improves intelligibility in noise Callan et al., 1997: pre-motor cortex activates when listening to speech at low SNR
Two Audiovisual Corpora
AVICAR Collected at University of Illinois 100 talkers (largest free AVSR database?) Read digits, digit strings, letters, sentences Naturalistic & Variable Lighting: Moving Car Naturalistic & Variable Noise: Wind, Cars, … CUAVE Collected at Clemson University 35 talkers Read digits & digit strings Controlled Lighting: Studio w/Green Screen Controlled Noise: Electrically Added
AVICAR Recording Hardware
(Lee, Hasegawa-Johnson et al., 2004)
4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard System is not permanently installed; mounting requires 10 minutes.
8 Mics, Pre-amps, Wooden Baffle.
Best Place= Sunvisor.
Video: Face & Lip Tracking
Visual-Only Recognition Results
Isolated Digits Video Features: Normalized DCT of the lip image Standard HMM speech recognizer, number of states per digit depends on number of phonemes Speaker-independent training, speaker-adapted recognition Recognition results (WER, percent) AVICAR: about 80% WER CUAVE: about 60% WER For comparison, other results reported for isolated digit recognition using controlled lighting, e.g., Chu and Huang 2002: about 60% WER
CUAVE Experiments
(Livescu, …, Lal, Hasegawa-Johnson et al., WS06)
• 169 utterances used, 10 digits each • NOISEX speech babble added at various SNRs • Experimental setup – Training on clean data, number of Gaussians tuned on clean dev set – Audio/video weights tuned on noise-specific dev sets – Uniform (“zero-gram”) language model – Decoding constrained to 10-word utterances (avoids language model scale/penalty tuning) • Thanks to Amar Subramanya at UW for the video observations • Thanks to Kate Saenko at MIT for initial Baselines, audio observations
Audio-only DBN Speech Recognizer
subWordStateAudio stateTransitionAudio phoneStateAudio obsAudio
Video-only DBN Speech Recognizer
subWordStateVideo stateTransitionVideo phoneStateVideo obsVideo
Audiovisual DBN with No Asynchrony
subWordState stateTransition phoneState
obsAudio and obsVideo obsV obsA
Articulatory Asynchrony
For example, tongue touches the teeth before acoustic speech onset in the word “three;” lips are already round in anticipation of the /r/.
Coupled HMM Model of A/V Asynchrony
(based on Chu and Huang, 2002)
subWordStateAudio stateTransitionAudio phoneStateAudio obsAudio subWordStateVideo stateTransitionVideo phoneStateVideo obsVideo
Asynchrony in Gestural Phonology
(Browman and Goldstein, 1992)
“three”
Lips Tongue Glottis
Round Spread
Dental Critical
Retroflex Narrow
Palatal Narrow
Unvoiced
Voiced
time
CHMM Model of Lip-Tongue-Glottis Asynchrony
subWordStateGlottis stateTransitionGlottis phoneStateGlottis subWordStateTongue stateTransitionTongue phoneStateTongue subWordStateLips stateTransitionLips
phoneStateLips
obsAudio and obsVideo
Results, Question 1: Should we use video? A: Yes Answer: Fusion WER < Single-stream WER
(Novelty: None. Many authors have reported this.)
90 80 70 60 50 40 30 20 10 0 CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB Audio Video Audiovisual
Results, Q2: Should the streams be asynchronous? A: Yes Async WER < Sync WER (30% relative @ mid SNRs)
( NOVELTY: 1st phone-based AVSR w/inter-phone asynchrony.)
70
60
50
40
30
No Asynchrony 1 State Async 2 States Async Unlimited Asyn
20
10
0 CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4
Results, Q3: Should we model Articulatory or A/V Asynchrony? Answer: It doesn’t matter. Articulatory feature WER = Phoneme-viseme WER.
(Novelty: First articulatory feature model for AVSR.)
80 70 60 50 40 30 20 10 0 Clean SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB Phone-viseme Articulatory features
Results, Q4: Are A/V and Articulatory Models Identical? Answer: No. Best WER: A/V and Articulatory Recognizers Vote
23
22
Word Error Rate on Development Test Data, Averaged across SNRs
21
20
19
18
17 Voting, Best Three w/ Artic Voting, Best Three w/o Artic A/V CHMM, 2 States Async Articulatory Features CHMM A/V CHMM, 1 State Async
Conclusions
• Prosodic cues (duration, pause, pitch) can reliably detect a talker’s end-of-turn • Simultaneous recognition of words and their prosody can reduce WER • SVMs can be trained to extract the dynamic spectral features present in the onset, nucleus, and coda of a syllable • DBN models of articulatory phonology
– can predict unusual pronunciation variants, and
– can model the inter-articulator asynchrony that causes apparent “audio-visual” asynchrony