Phonology and the art of Automatic Speech Recognition

Reviews
Shared by: Juan Agui
Stats
views:
26
rating:
not rated
reviews:
0
posted:
4/28/2009
language:
English
pages:
0
Phonology and the art of Automatic Speech Recognition Mark Hasegawa-Johnson ECE Department and the Beckman Institute for Advanced Science and Technology University of Illinois Urbana-Champaign, Illinois, USA Outline • A Brief History of Ideas • The Prosodic Hierarchy • The Utterance – End-of-Turn Detection • The Intonational Phrase – Prosody-Dependent Speech Recognition • The Word – Articulatory Phonology Models of Coarticulation • The Syllable – Landmark-Based Speech Recognition – Audiovisual Speech Recognition • Conclusions A Brief History of Ideas: Global     Mechanics Science: 1687 (Newton’s Principia) Technology: 1825 (Stockton & Darlington Railroad opens) Human Benefits: 1850 (World per capita GDP, $800 in 2005 dollars, annual growth rate rises to 1%)     Electricity and Magnetism Science: 1745 (van Muschenbroek invents Leyden jar) Technology: 1876 (Bell invents telephone) Human Benefits: 1950 (World per capita GDP, $2100, annual growth rate rises to 3%)     Spoken Communication Science: 1867 (Bell proposes “Universal Alphabetic”) Technology: 1978 (TI sells the “Speak and Spell”) Human Benefits: 2045 (language-independent markets for capital and intellectual talent drive the world per capita GDP, $30000, to a growth rate above 4% annually) A Brief History of Ideas: Local   The Prosodic Hierarchy Based on ideas of Selkirk, 1981; Nespor and Vogel, 1986   End-of-Turn Detection Reported research was performed by Kyle Gorman advised by Cole, Fleck, and Hasegawa-Johnson    Prosody-Dependent Speech Recognition Based on ideas of Ostendorf, Byrne, Shriberg, Talkin, Waibel et al., 1996 Reported research was performed by Ken Chen, Sarah Borys, and Sung-Suk Kim advised by Cole and Hasegawa-Johnson    Landmark-Based Speech Recognition Based on ideas of Stevens, Manuel, Shattuck-Hufnagel, and Liu, 1992 Reported research was performed by Sarah Borys and Karen Livescu advised by Niyogi, Glass, Espy-Wilson, and Hasegawa-Johnson    Audiovisual Speech Recognition Based on the algorithms of Chu and Huang, 2001 Reported research was performed by Ming Liu, Kate Saenko, Partha Lal, Mark Hasegawa-Johnson, Karen Livescu, Özgur Çetin The Prosodic Hierarchy C1: Utterance C2: Intonational Phrase C3: Intermediate Phrase C4: Prosodic Word C5: Foot Wan ted Chief Jus tice of the Ma ssa chu setts Su preme Court C6: Syllable 1. 2. 3. 4. Layered Constituents: Ci can only dominate Ck for k > i Headed Constituents: Each Ci dominates at least one Ck Non-Recursive Layering: No Ci dominates a Ci Exhaustive Layering: No Ci dominates a Ci+2 Prosody: The Units of Articulatory Planning and Perception   Processes bounded within the Utterance turn-taking cues, e.g., pause, duration, pitch, lexical cues    Processes bounded within the Intonational Phrase sequencing/stair-stepping of pitch accents phrase-final pitch effects: declarative fall, question rise, …    Processes bounded within the Intermediate Phrase phrasal stress/ pitch accent phrase tone   Processes bounded within the Prosodic Word co-articulation   Processes bounded within the Foot vowel reduction, lexical stress   Processes bounded within the Syllable abrupt onset, syllabic nuclear peak, abrupt offset The Utterance End-of-Turn Detection ≠ Pause Detection (Local, Kelly and Wells, 1986) Prosodic Features on Utterance-Final Word Can be Automatically Detected (Ferrer, Shriberg and Stolcke, 2002-3) Final word longer than a typical production of “oven” “Declaration fall:” pitch falls on utterance-final word “fire” has increased duration suggesting a possible EOT, but… “it” is very short, and ends abruptly with glottal stop. Prosodic Features for EOT Detection (Gorman, Cole, Hasegawa-Johnson and Fleck, LSA 2007)    Pause Features: Silence Instant-response classifiers: truncate above pauses after 80, 100, …, 300ms, results are stable with pause duration truncated at 300ms     Duration Features: Normalized last stressed vowel duration Last stressed rhyme duration Last rhyme duration    Pitch Features: Minimum or median, last word or last N frames F0 slope of word (continuous), or at boundary (categorical)     Context Features: Speaker gender Number of words since turn beginning Length of previous pause Prosodic Features for EOT Detection (Gorman, Cole, Hasegawa-Johnson and Fleck, LSA 2007) Classifier Chance Pitch Duration Error Rate (Percent) 50 48.7 69.6 Pause All Features + Context Pause+Duration+Context Pause+Pitch+Context 10.2 6.4 6.2 5.4 The Intonational Phrase Intonational Phrases and Pitch Accents • Tagged Transcription: Wanted*B4 chief* justice* of the Massachusetts* supreme court*B4 – B4 denotes intonational-phrase-final word – * denotes pitch accented word • Data: Boston Radio Speech corpus – 7 talkers; Professional radio announcers – About 3.5 hours of speech prosodically transcribed (ToBI = “tones and break indices” notation) – Largest prosodically transcribed database, but… – Less than 1% of size of speech recognition training databases: e.g., too small to train triphones Intonational Phrases and Pitch Accents: a Bayesian Network Model of Speech (Chen, Hasegawa-Johnson et al., 2003) Frame Level Y H P S M X Q W Segmental Level Word Level X: acoustic-phonetic observations Y: acoustic-prosodic observations Q: phonemes H: phone-level prosodic tags W: words P: word-level prosodic tags S: syntax M: message Intonational Phrases and Pitch Accents: Prosody-Dependent Speech Recognition X,Y Q,H • ˆ [W ]  arg max p(O | Q, H )  p(Q, H | W , P)  p(W , P) Advantages: – Natural extension of standard prosody independent speech recognition – Allow the convenient integration of useful linguistic knowledge at different levels – Flexible – Fast W,P S M • Disadvantage: – Requires prosodically-transcribed training data The Lexicon: p(Q,H|W,P) • Tagged Transcription: Wanted*B4 chief* justice* of the Massachusetts* supreme court*B4 • Lexicon: – Each word has four entries • wanted, wanted*, wantedB4, wanted*B4 – IP boundary applies to phones in rhyme of final syllable • wantedB4 w aa n t axB4 dB4 – Accent applies to phones in lexically stressed syllable • wanted* w* aa* n* t ax d The Acoustic Model: p(X,Y|Q,H) In order to train on such a small database, we propose a Factored Acoustic Model: p(X,Y|Q,A,B) = Pi p(di|qi,bi) Pt p(xt|qi) p(yt|qi,ai) – – – – – – prosody-independent phone label qi € {aa,ae,b,d,…} pitch accent type ai € {Accented,Unaccented} intonational phrase position bi € {Final,Nonfinal} di = duration of phone qi (explicitly modeled) xt = standard speech recognition features (MFCC) yt = nonlinearly transformed pitch features Acoustic-Prosodic Observations: yt = ANN(lnf0(t-5),…,lnf0(t+5)) Explicit Duration HMM: Phrase-Final vs. Non-Final Duration Histograms /a/ phrase-medial and phrase-final /c/ phrase-medial and phrase-final A Factored Language Model Unfactored pi-1,wi-1 Factored pi-1 wi-1 pi wi 2. • • pi,wi 1. Prosodically tagged words: cats* climb trees*% Unfactored: Prosody and word string jointly modeled: p( trees*% | cats* climb ) Factored: Prosody depends on syntax: p( w*% | N V N, w* w ) si-1 si Syntax depends on words: p( N V N | cats climb trees ) Result: Syntactic Mediation of Prosody Reduces Perplexity and WER pi-1 wi-1 pi wi Factored Model: Reduces Perplexity by 35% Reduces WER by 4% si-1 si Syntactic Tags: For pitch accent: • POS sufficient For IP boundary: • Parse information useful if available Syntactic Factors: POS, Syntactic Phrase Boundary Depth 45 40 35 30 25 20 15 10 5 0 Accent Boundary Prediction Error Prediction Error Chance POS POS + Phrase Results: Word Error Rate (Boston Radio Speech Corpus) 25 24.5 24 23.5 23 22.5 22 21.5 21 20.5 20 Word Error Rate Baseline PD Acoustic PD Language PD Both The Word Coarticulation: the Gestural Phonology Model (Browman and Goldstein, 1992) “everybody” → “erwodi” Lips Tongue Glottis /v/ /r/ /b/ /ah/ /iy/ /r/ /eh/ /v/ /eh/ /r/ /b/ /ah/ /d/ /iy/ /d/ /iy/ time Simplified Gestural Phonology: Coarticulation = Lazy Articulators (Livescu and Glass, 2004) Pronunciation Model: Dynamic Bayesian Network (DBN) with Partially Asynchronous Articulators (Livescu and Glass, 2004) • wordt: word ID at frame #t • wdTrt: word transition? • indti: which gesture, from the canonical word model, should articulator i be trying to implement? • asyncti;j: how asynchronous are articulators i and j? • Uti: canonical setting of articulator #i • Sti: surface setting of articulator #i Articulatory Feature DBN Experiments • Background: – [Livescu and Glass, 2004]: • AF model predicts mapping from canonical pronunciations to human transcriptions better than HMM – [Saenko and Livescu, 2006]: • AF model recognizes visible speech better than HMM • To be presented today: – AF model as part of a landmark-based speech recognizer: [Hasegawa-Johnson, …, Livescu, et al., WS04]: – AF model for audiovisual speech recognition: [Livescu, …, Hasegawa-Johnson, et al., WS06] The Syllable: LandmarkBased Speech Recognition What are Landmarks? • Instants of rapid spectral change (dX/dt). • Instants of high spectral entropy (H((X(t)|X(t-1))). • Instants of high mutual information between phoneme and signal (I(q;X(t-d),…,X(t+d)). Where do these things happen? • Consonant closures (fricative, stop, nasal) • Consonant releases (fricative, stop, nasal) • Syllable nuclei Landmark-Based Speech Recognition MAP transcription: … backed up … Syllable Structure ONSET NUCLEUS ONSET NUCLEUS CODA Search Space: … … buck up … … big dope … … backed up … … bagged up … … … big doowop … … CODA Stop Detection using Support Vector Machines (Niyogi, Ramesh, and Burges, 1999, 2002) False Acceptance vs. False Rejection Errors per 10ms frame, Four Types of Stop Detectors (1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2% (2) HMM (*): False Rejection Error=0.3% (3) Linear SVM: EER = 0.15% (4) RBF SVM: Equal Error Rate=0.13% Two Types of SVMs: Landmark Detectors (p(landmark(t)), Landmark Classifiers (p(place-features(t)|landmark(t)) 2000-dimensional acoustic feature vector SVM Discriminant yi(t) Sigmoid or Histogram Posterior probability of distinctive feature p(di(t)=1 | yi(t)) Acoustic Feature Vector: Local Cepstrogram, Formants, Auditory Modeling Features Covering +/-70ms SVM Training: Accuracy, Per Frame, in Percent (Chance=50%) Train Test Kernel speech onset speech offset consonant onset consonant offset continuant onset continuant offset sonorant onset sonorant offset syllabic onset syllabic offset NTIMIT NTIMIT Linear 95.1 79.6 94.5 91.7 89.4 90.8 95.6 95.3 90.7 90.1 RBF 96.2 88.5 95.5 93.7 94.1 94.9 97.2 96.4 95.2 88.9 NTIMIT Switchboard Linear 71.4 65.3 70.3 80.3 69.1 69.3 85.2 75.6 69.5 54.4 RBF 62.2 78.6 72.7 86.2 81.9 68.8 86.5 75.2 78.9 60.8 Switchboard Switchboard Linear 81.6 68.4 95.8 92.8 86.2 89.6 96.3 95.2 87.9 88.2 RBF 81.6 83.7 97.7 96.8 92.0 94.3 96.3 96.4 92.6 89.7 SVM/HMM Hybrid (Borys and Hasegawa-Johnson, ICSLP 2005) • • • • 10 landmark-detection SVMs 23 landmark-classification SVMs Acoustic features: MFCC+d+dd, formant freqs+amps HMM baseline speech recognizer: 3 states per phone, constrained only by a phoneme bigram • Raw real-valued SVM discriminant output fed to HMM, modeled there using mixture Gaussian PDFs, as in the “tandem” NN/HMM hybrid (Ellis et al., 2000) Phone Error Rate MFCCs+d+dd SVM Tandem 63.9% 62.7% DBN-SVM Model of Pronunciation Variability (Hasegawa-Johnson, Baker, …, Livescu et al., WS04, ICASSP 2005) Word LIKE Tongue Mid A Tongue open Canonical Form Tongue front Tongue closed Surface Form Manner Place Tongue front Palatal Semi-closed Glide … Tongue Front Tongue open Front Vowel … … SVM Outputs x: Multi-Frame Observation including p( gPGR(x) | palatal glide release) p( gGR(x) | glide release ) Spectrum, Formants, & Auditory Model SVM/DBN Hybrid: Design Decisions • SVM Applied: When should SVM supply place feats to the DBN? – Landmarks: Only at SVM-detected landmarks – Frames: In every frame • Use Place?: In frame-based hybrid, how use place features be used? – Always: use place features for segmentation and recognition – Recognition: use place features only for recognition – Selective: use only the high-accuracy place features; ignore others • Probs: How should SVM information be passed to the DBN? – Posterior: SVM output converted to a posterior probability – Likelihood: Posterior is normalized to estimate a pseudo-likelihood • DBN Training: How should DBN be trained? – Manual: Using manual landmark transcriptions (ICSI Switchboard) – SVMs: Using SVM-detected landmarks Landmark-Based Speech Recognizer used to Rescore the 2003 SRI Decipher System (Hasegawa-Johnson, Baker,…, Livescu, et al., WS04, ICASSP 2005) For each word hypothesis generated by the SRI Decipher Speech Recognizer: – SVM probabilities computed during word hypothesis, input to DBN – DBN computes a score S  P(word | evidence) – Final edge score is a weighted interpolation of first-pass speech recognizer scores, together with the DBN score SVM Applied Baseline: Landmarks Landmarks Landmarks Frames Always Frames Recognition Frames Selective Use Place? Probs DBN Training 3-speaker WER (# errors) 27.7 (550) 27.6 (549) 27.3 (543) 27.3 (543) >100% 27.3 (542) 27.2 (541) SRI Decipher Recognizer as of Fall 2003 Likelihood Manual Posterior Posterior Posterior Posterior Posterior Manual SVM SVM SVM SVM The Syllable: Audiovisual Speech Why Use Visual Information?  Visuals are unaffected by noise  Human listeners use visuals in quiet:  McGurk and MacDonald, 1976: listeners unable to hear a /b/ if see lips that stay open  Human listeners use visuals in noise:  Sumby and Pollack, 1954: visible talker improves intelligibility in noise  Callan et al., 1997: pre-motor cortex activates when listening to speech at low SNR Two Audiovisual Corpora  AVICAR  Collected at University of Illinois  100 talkers (largest free AVSR database?)  Read digits, digit strings, letters, sentences  Naturalistic & Variable Lighting: Moving Car  Naturalistic & Variable Noise: Wind, Cars, …  CUAVE  Collected at Clemson University  35 talkers  Read digits & digit strings  Controlled Lighting: Studio w/Green Screen  Controlled Noise: Electrically Added AVICAR Recording Hardware (Lee, Hasegawa-Johnson et al., 2004) 4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard System is not permanently installed; mounting requires 10 minutes. 8 Mics, Pre-amps, Wooden Baffle. Best Place= Sunvisor. Video: Face & Lip Tracking Visual-Only Recognition Results  Isolated Digits  Video Features: Normalized DCT of the lip image  Standard HMM speech recognizer, number of states per digit depends on number of phonemes  Speaker-independent training, speaker-adapted recognition  Recognition results (WER, percent)  AVICAR: about 80% WER  CUAVE: about 60% WER  For comparison, other results reported for isolated digit recognition using controlled lighting, e.g., Chu and Huang 2002: about 60% WER CUAVE Experiments (Livescu, …, Lal, Hasegawa-Johnson et al., WS06) • 169 utterances used, 10 digits each • NOISEX speech babble added at various SNRs • Experimental setup – Training on clean data, number of Gaussians tuned on clean dev set – Audio/video weights tuned on noise-specific dev sets – Uniform (“zero-gram”) language model – Decoding constrained to 10-word utterances (avoids language model scale/penalty tuning) • Thanks to Amar Subramanya at UW for the video observations • Thanks to Kate Saenko at MIT for initial Baselines, audio observations Audio-only DBN Speech Recognizer subWordStateAudio stateTransitionAudio phoneStateAudio obsAudio Video-only DBN Speech Recognizer subWordStateVideo stateTransitionVideo phoneStateVideo obsVideo Audiovisual DBN with No Asynchrony subWordState stateTransition phoneState obsAudio and obsVideo obsV obsA Articulatory Asynchrony For example, tongue touches the teeth before acoustic speech onset in the word “three;” lips are already round in anticipation of the /r/. Coupled HMM Model of A/V Asynchrony (based on Chu and Huang, 2002) subWordStateAudio stateTransitionAudio phoneStateAudio obsAudio subWordStateVideo stateTransitionVideo phoneStateVideo obsVideo Asynchrony in Gestural Phonology (Browman and Goldstein, 1992) “three” Lips Tongue Glottis Round Spread Dental Critical Retroflex Narrow Palatal Narrow Unvoiced Voiced time CHMM Model of Lip-Tongue-Glottis Asynchrony subWordStateGlottis stateTransitionGlottis phoneStateGlottis subWordStateTongue stateTransitionTongue phoneStateTongue subWordStateLips stateTransitionLips phoneStateLips obsAudio and obsVideo Results, Question 1: Should we use video? A: Yes Answer: Fusion WER < Single-stream WER (Novelty: None. Many authors have reported this.) 90 80 70 60 50 40 30 20 10 0 CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB Audio Video Audiovisual Results, Q2: Should the streams be asynchronous? A: Yes Async WER < Sync WER (30% relative @ mid SNRs) ( NOVELTY: 1st phone-based AVSR w/inter-phone asynchrony.) 70 60 50 40 30 No Asynchrony 1 State Async 2 States Async Unlimited Asyn 20 10 0 CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4 Results, Q3: Should we model Articulatory or A/V Asynchrony? Answer: It doesn’t matter. Articulatory feature WER = Phoneme-viseme WER. (Novelty: First articulatory feature model for AVSR.) 80 70 60 50 40 30 20 10 0 Clean SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB Phone-viseme Articulatory features Results, Q4: Are A/V and Articulatory Models Identical? Answer: No. Best WER: A/V and Articulatory Recognizers Vote 23 22 Word Error Rate on Development Test Data, Averaged across SNRs 21 20 19 18 17 Voting, Best Three w/ Artic Voting, Best Three w/o Artic A/V CHMM, 2 States Async Articulatory Features CHMM A/V CHMM, 1 State Async Conclusions • Prosodic cues (duration, pause, pitch) can reliably detect a talker’s end-of-turn • Simultaneous recognition of words and their prosody can reduce WER • SVMs can be trained to extract the dynamic spectral features present in the onset, nucleus, and coda of a syllable • DBN models of articulatory phonology – can predict unusual pronunciation variants, and – can model the inter-articulator asynchrony that causes apparent “audio-visual” asynchrony

Related docs
premium docs
Other docs by Juan Agui
Express company receipt
Views: 144  |  Downloads: 0
Gamers[0]
Views: 129  |  Downloads: 0
35029[5]
Views: 189  |  Downloads: 0
Distributions between partners
Views: 201  |  Downloads: 4
Compensation
Views: 272  |  Downloads: 11
Minutes of Directors Meeting
Views: 232  |  Downloads: 8
Puerto Rico certificate of incorporation
Views: 182  |  Downloads: 0
GettingaBuzzoutofJudaism
Views: 124  |  Downloads: 0
Boys_Night_Out_Permission_Slip1
Views: 143  |  Downloads: 1
WAIVER OF NOTICE OF MEETING
Views: 173  |  Downloads: 4
Transcript of Tennessee Valley Authority Act
Views: 157  |  Downloads: 0