Landmark-Based Speech Recognition Spectrogram Reading, Support - PowerPoint

Document Sample
Landmark-Based Speech Recognition Spectrogram Reading, Support - PowerPoint Powered By Docstoc
					Landmark-Based Speech Recognition:
Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Mark Hasegawa-Johnson
jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

Lecture 3: Spectral Dynamics and the Production of Consonants
• International Phonetic Alphabet • Events in the Closure of a Nasal Consonant
– Formant transitions: a perturbation model – Nasalized vowel – Nasal murmur

• Events in the Release of a Stop Consonant
– – – – – Pre-voicing (voiced stops in carefully read English) Transient (stops and affricates) Frication (stops, affricates, and fricatives) Aspiration (aspirated stops and /h/) Formant Transitions (any consonant-vowel transition)

• Formant Tracking
– Does it help Speech Recognition? – Methods for Vowels, and for Aspiration & Nasals

• Reminder – lab 1 due Monday!

International Phonetic Alphabet: Purpose and Brief History
• Purpose of the alphabet: to provide a universal notation for the sounds of the world’s languages
– “Universal” = If any language on Earth distinguishes two phonemes, IPA must also distinguish them – “Distinguish” = Meaning of a word changes when the phoneme changes, e.g. “cat” vs. “bat.”

• Very Brief History:
– 1876: Alexander Bell publishes a distinctive-feature-based phonetic notation in “Visible Speech: The Science of the Universal Alphabetic.” His notation is rejected as being too expensive to print – 1886: International Phonetic Association founded in Paris by phoneticians from across Europe – 1991: Unicode provides a standard method for including IPA notation in computer documents

International Phonetic Alphabet: Vowels
Pinyin ARPABET (Approx.) i /u (xu) IY / UX

Pinyin ARPABET (Approx.)
/ u (zhu) / UW o UH / OW

EY EH a (zhang) AE a (ma)

/ oa Pinyin:e ARPA:AX /o

AH / AO

a (ma)

AA

IPA: Regular Consonants
Tongue Blade Tongue Body

NG DX

Q

HH/HV
R
Y

ARPABET: F/V (labiodental), TH/DH (dental), S/Z (alveolar), SH/ZH (postalveolar or palatal) Pinyin: s (alveolar), x (postalveolar), sh/r (retroflex)

Affricates and Doubly-Articulated Consonants
ARPABET WH W

Affricates in English and Chinese: Pinyin ARPABET Alveolar: c/z Post-alveolar: q/j CH/JH Retroflex: ch/zh

IPA ts/dz tʃ/dʒ ţş/ɖʐ

Non-Pulmonic Consonants

Events in the Closure of a Syllable-Final Nasal Consonant

Events in the Closure of a Nasal Consonant

Formant Transitions

Vowel Nasalization Nasal Murmur

Formant Transitions: A Perturbation Theory Model

“the mom”

Formant Transitions: Labial Consonants
“the bug”

“the supper”

Formant Transitions: Alveolar Consonants
“the tug”

“the shoe”

Formant Transitions: Post-alveolar Consonants
“the zsazsa”

Formant Transitions: Velar Consonants

“the gut”

“sing a song”

Formant Transitions: A Perceptual Study

The study: (1) Synthesize speech with different formant patterns, (2) record subject responses. Delattre, Liberman and Cooper, J. Acoust. Soc. Am. 1955.

Perception of Formant Transitions: Conclusions

Vowel Nasalization

Vowel Nasalization

Additive Terms in the Log Spectrum

Transfer Function of a Nasalized Vowel

Nasal Murmur
“the mug”
“the nut” “sing a song”

Observations: Low-frequency resonance (about 300Hz) always present Low-frequency resonance has wide bandwidth (about 150Hz) Energy of low-frequency resonance is very constant Most high-frequency resonances cancelled by zeros Different places of articulation have different high frequency spectra High-frequency spectrum is talker-dependent and variable

Resonances of a Nasal Consonant

Reference: Fujimura, JASA 1962

Anti-Resonances of a Nasal Consonant

Events in the Release of a Stop (Plosive) Consonant

Events in the Release of a Stop

“Burst” = transient + frication (the part of the spectrogram whose transfer function has poles only at the front cavity resonance frequencies, not at the back cavity resonances).

Events in the Release of a Stop
Unaspirated (/b/) Transient Frication Aspiration Voicing Aspirated (/t/)

Pre-voicing during Closure
To make a voiced stop in most European languages: Tongue root is relaxed, allowing it to expandm so that vocal folds can continue to vibrating for a little while after oral closure. Result is a lowfrequency “voice bar” that may continue well into closure. In English, closure voicing is typical of read speech, but not casual speech. “the bug”

Transient: The Release of Pressure

Transfer Function During Transient and Frication: Poles
Turbulence striking an obstacle makes noise

Front cavity resonance frequency: FR = c/4Lf

Transfer Function During Frication: An Important Zero

Transfer Function During Frication: An Important Zero

Transfer Function During Aspiration

Are Formant Frequencies Useful for Speech Recognition?
• Kopec and Bush (1992): WER(formants alone) > WER(cepstrum alone) > WER(formants and cepstrum together) • How should we track formants? – In vowels: Autoregressive (AR) modeling (also known as LPC) – In aspiration, nasals: Autoregressive Moving Average (ARMA) modeling. Problem: no closedform solution – In aspiration, nasals: Exponentially Weighted Autoregressive (EWAR; Zheng and HasegawaJohnson, ICASSP 2004)

Formant Tracking for Vowels: Autoregressive Model (LPC)

Formant Tracking for Aspiration: “Auto-Regressive Moving Average” Model (ARMA)

Formant Tracking for Aspiration: “Exponentially Weighted AutoRegressive” Model (EWAR)
(Zheng and Hasegawa-Johnson, ICSLP 2004)

Solving the EWAR Model

Results: Stop Classification, MFCC alone vs. MFCC+formants

Results: Stop Classification, MFCC alone vs. MFCC+formants

Summary
• International Phonetic Alphabet:
– Useful on any computer with unicode – International encoding for all sounds of the world’s languages

• Events in a nasal closure:
– Formant transitions (perturbation model) – Vowel nasalization (sum of TFs) – Nasal murmur (impedance match at juncture)

• Events in release of a stop:
– – – – Pre-voicing in English voiced stops (read speech) Transient (dp/dt ~ dA/dt) Frication ((zero at f=0)/(front cavity resonances)) Aspiration ((zero at f=0)/(same poles as the vowel))

• Formant tracking
– In a vowel: use LPC – In aspiration, frication, or nasal murmur: ARMA is theoretically optimum, but computationally expensive – Aspiration etcetera: EWAR can be a good approximation to ARMA