Towards improved proper name recognition
Bert Réveil and Jean-Pierre Martens
DSSP group, Ghent University, Department of Electronics and Information Systems
Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
{breveil,martens}@elis.ugent.be
Topic description
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------
Automatic proper name recognition is a key component of multiple speech-based applications (e.g. voice-driven navigation systems). This recognition is challenged by the
mismatch between the way the names are represented in the recognizer and the way they are actually pronounced:
Please guide me
• Incorrect phonemic name transcriptions: common grapheme-to-phoneme (G2P) RECOGNITION SYSTEM GPS towards ‘A&u.stIn
converters can‟t cope with archaic spelling and foreign name parts, manual HMMs Lexicon
transcriptions are too costly (e.g. Ugchelsegrensweg, Haînautlaan) “O”
…
• Multiple plausible name pronunciations: within or across languages (e.g. Roger) Austin 'O.stIn
• Cross-lingual pronunciation variation: foreign names, foreign application users … …
In order to improve the phonemic transcriptions and capture the pronunciation variation we adopt acoustic and lexical modeling approaches. Acoustic modeling targets a
better modeling of the expected utterance sounds. Lexical modeling tries to foresee the most plausible phonemic transcription(s) for each name in the recognition lexicon.
Experimental set-up Acoustic and lexical modeling strategies
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------
Database: Autonomata Spoken Name Corpus (ASNC) The modeling approaches are firstly conceived for the primary targeted users, also
• 120 Dutch, 40 English, 20 French, 40 Moroccan and 20 Turkish speakers called the native (NAT) users (in our case Dutch natives). W.r.t. these users, two
• Every speaker reads 181 names with either Dutch, English, French, Moroccan or types of non-native languages are distinguished: foreign languages that most NAT
Turkish origin speakers are familiar with (NN1), and other foreign languages (NN2).
• Non-overlapping train and test set (disjunctive names, speakers)
• Human expert transcriptions
Strategy 1: Incorporating NN1 language knowledge
- TY: typical Dutch transcription (one for each name from TeleAtlas) • Acoustic modeling: two model sets
- AV: auditory verified Dutch transcription (one for each name utterance) - AC-MONO : standard NAT Dutch model (trained on Dutch speech alone)
This work: only Dutch native utterances + non-native utterances of Dutch names - AC-MULTI : Dutch (20%) and NN1 training data (English, French and German)
Table 1: Number of utterances for all (speaker,name) pairs in train and test set Lexical modeling
Set DU EN FR MO TU - G2P transcribers for NAT and NN1 languages (Nuance RealSpeak TTS)
(DU,*) train 9960 1909 966 1245 943 Foreign transcriptions are nativized in combination with AC-MONO
test 4440 851 414 555 437 - Data-driven selection of one extra G2P converter per name origin
(*,DU) train 9960 3000 1680 3360 1560
Strategy 2: Creating pronunciation variants (lexical modeling)
test 4440 1800 720 1440 840
- Computed per (speaker, name) combination
Speech recognizer: state-of-the-art VoCon 3200 from Nuance - Created from initial G2P transcriptions by means of automatically learned
• Grammar: name loop with 21K different names (3.5K names of ASNC + 17.5K others) phoneme-to-phoneme (P2P) converters
Construction of phoneme-to-phoneme converters
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------
P2P learning requires the orthographic transcription, an initial G2P transcription and a target phonemic transcription (e.g. TY or AV) of a sufficiently large collection of
name utterances. These 3-tuples are supplied to a 4 step training procedure:
High level Initial Target
• Two-fold alignment: Orthography ↔ Initial transcription ↔ Target transcription features
Orthography
transcription transcription
~ D i r k () V a n () D e n () ~ B o ~ ssch e
„ d I r K _ f A n _ d E n _ „ b O . s $ Alignment process Alignment process
„ d i r k _ v A n _ d $ m _ ~ b O . s $ (letter-to-sound) (sound-to-sound)
• Transformation retrieval
• Generation of training examples: describe linguistic context
Transformation
Previous and next phonemes and graphemes learning
Learn morphological
Lexical context (Part Of Speech) classes
Prosodic context (stressed syllable or not)
Morphological context (word prefix/suffix)
External features: e.g. name type, name source, speaker tongue Example generation
• Rule induction
Learn decision tree per input (pattern): stochastic rules in leaf nodes
Rule formalism: if context → leaf node then [input pattern] → [output pattern] with probability Pfir Stochastic rule induction
In generation mode: rules applied to initial G2P transcription of unseen name variants with probabilities
Experimental assessment
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------
Incorporating NN1 language knowledge Table 2: Name Error Rate (%) for systems with G2P lexicons
• Including extra G2P transcriptions (acoustic model = AC-MONO) (spkr,name) System DU EN FR MO TU
- Boost for (DU,-DU): NAT speakers use NN1 knowledge when (DU,*) AC-MONO + DUN G2P 6.5 38.5 21.3 14.6 28.4
reading foreign names, including NN2 names AC-MONO + 4G2P (nativized) 7.2 22.7 9.9 9.5 17.2
- Degradation for (DU,DU): reduced by selecting only one extra G2P AC-MONO + G2P-selection (nativized) 6.5 20.8 7.2 9.0 18.1
• Decoding with multilingual acoustic model AC-MULTI + G2P-selection (nativized) 8.5 14.9 7.2 8.3 16.2
- NAT speakers: loss for NAT names, boost for English names only AC-MULTI + G2P-selection (plain) 8.5 14.0 7.7 8.6 18.1
Dutch sounds not as well modeled as before (*,DU) AC-MONO + DUN G2P 6.5 25.1 33.2 26.9 40.8
English better known than French? AC-MONO + 4G2P (nativized) 7.2 22.8 32.2 27.0 40.6
English and Dutch sound inventories differ more than French and Dutch?
AC-MONO + G2P-selection (nativized) 6.5 22.8 31.1 25.3 38.5
- Foreign speakers: boost for both NN1 name origins
AC-MULTI + G2P-selection (nativized) 8.5 17.6 22.6 25.2 38.6
- mother tongue sounds better modeled
AC-MULTI + G2P-selection (plain) 8.5 18.2 22.6 25.8 40.4
• Plain multilingual G2P transcriptions bring no improvement
Creating pronunciation variants Table 3: Name Error Rate (%) for systems with P2P transcription variants
(spkr,name) System DU EN FR MO TU
• Baseline P2Ps: Dutch G2P transcriptions as initials, AV transcriptions as targets (DU,*) AC-MULTI + G2P-selection (nativized) 8.5 14.9 7.2 8.3 16.2
- Alternative P2Ps for (DU,NN1) and (NN1,DU) cells + 4 P2P variants (baseline) 7.7 13.2 6.3 7.0 11.9
- create additional P2P that starts from NN1 G2P transcriptions + 4 P2P variants (alternative) 7.7 12.2 6.3 7.0 11.9
- combine most probable variants generated by both P2P converters (*,DU) AC-MULTI + G2P-selection (nativized) 8.5 17.6 22.6 25.2 38.6
• P2P variants lead to significant improvements for all (speaker, name) cells + 4 P2P variants (baseline) 7.7 17.2 19.9 24.0 35.2
- 10 .. 25% relative for NAT + foreign names , 5 .. 17% for foreign speakers + 4 P2P variants (alternative) 7.7 16.4 18.8 24.0 35.2
Acknowledgments References
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------
The presented work was carried out in the Autonomata TOO project, granted under the Dutch-Flemish STEVIN [1] B. Réveil, J.-P. Martens and B. D‟hoore, How speaker tongue and name source language affect the automatic
program (http://taalunieversum.org/taal/technologie/stevin/), with partners RU Nijmegen, Universiteit Utrecht, recognition of spoken names, in Proc. InterSpeech 2009, UK, Brighton
Nuance and TeleAtlas. [2] H. van den Heuvel, B. Réveil and J.-P. Martens, Pronunciation-based ASR for names, in Proc. InterSpeech
2009, UK, Brighton
[3] B. Réveil, J.-P. Martens and H. van den Heuvel, Improving proper name recognition by adding automatically
learned pronunciation variants to the lexicon, in Proc. LREC 2010, Valletta, Malta