Docstoc

poster

Document Sample
poster Powered By Docstoc
					                                                                                                                                  Towards improved proper name recognition
                                                                                                                                                                                                                                          Bert Réveil and Jean-Pierre Martens
                                                                                                                                          DSSP group, Ghent University, Department of Electronics and Information Systems
                                                                                                                                                        Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium
                                                                                                                                                                 {breveil,martens}@elis.ugent.be


Topic description
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------


Automatic proper name recognition is a key component of multiple speech-based applications (e.g. voice-driven navigation systems). This recognition is challenged by the
mismatch between the way the names are represented in the recognizer and the way they are actually pronounced:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Please guide me
          • Incorrect phonemic name transcriptions: common grapheme-to-phoneme (G2P)                                                                                                                                                                                                                                                                               RECOGNITION SYSTEM GPS                                                                                                                                                                                                                   towards ‘A&u.stIn
            converters can‟t cope with archaic spelling and foreign name parts, manual                                                                                                                                                                                                                                                                               HMMs        Lexicon
            transcriptions are too costly (e.g. Ugchelsegrensweg, Haînautlaan)                                                                                                                                                                                                                                                                                                          “O”
                                                                                                                                                                                                                                                                                                                                                                                                                                 …
          • Multiple plausible name pronunciations: within or across languages (e.g. Roger)                                                                                                                                                                                                                                                                                                                                      Austin 'O.stIn
          • Cross-lingual pronunciation variation: foreign names, foreign application users                                                                                                                                                                                                                                                                                                                 …                    …

In order to improve the phonemic transcriptions and capture the pronunciation variation we adopt acoustic and lexical modeling approaches. Acoustic modeling targets a
better modeling of the expected utterance sounds. Lexical modeling tries to foresee the most plausible phonemic transcription(s) for each name in the recognition lexicon.


Experimental set-up                                                                                                                                                                                                                                                                                                                                     Acoustic and lexical modeling strategies
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------                  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------


Database: Autonomata Spoken Name Corpus (ASNC)                                                                                                                                                                                                                                                                                                          The modeling approaches are firstly conceived for the primary targeted users, also
• 120 Dutch, 40 English, 20 French, 40 Moroccan and 20 Turkish speakers                                                                                                                                                                                                                                                                                 called the native (NAT) users (in our case Dutch natives). W.r.t. these users, two
• Every speaker reads 181 names with either Dutch, English, French, Moroccan or                                                                                                                                                                                                                                                                         types of non-native languages are distinguished: foreign languages that most NAT
                          Turkish origin                                                                                                                                                                                                                                                                                                                speakers are familiar with (NN1), and other foreign languages (NN2).
• Non-overlapping train and test set (disjunctive names, speakers)
• Human expert transcriptions
                                                                                                                                                                                                                                                                                                                                                        Strategy 1: Incorporating NN1 language knowledge
          - TY: typical Dutch transcription (one for each name from TeleAtlas)                                                                                                                                                                                                                                                                          • Acoustic modeling: two model sets
          - AV: auditory verified Dutch transcription (one for each name utterance)                                                                                                                                                                                                                                                                           - AC-MONO : standard NAT Dutch model (trained on Dutch speech alone)
This work: only Dutch native utterances + non-native utterances of Dutch names                                                                                                                                                                                                                                                                                - AC-MULTI : Dutch (20%) and NN1 training data (English, French and German)
                                                           Table 1: Number of utterances for all (speaker,name) pairs in train and test set                                                                                                                                                                                                             Lexical modeling
                                                                                          Set                                   DU                                   EN                                    FR                                  MO                                    TU                                                                       - G2P transcribers for NAT and NN1 languages (Nuance RealSpeak TTS)
                                                 (DU,*)                                 train                                9960                                  1909                                   966                                1245                                   943                                                                                        Foreign transcriptions are nativized in combination with AC-MONO
                                                                                         test                                4440                                   851                                   414                                  555                                  437                                                                       - Data-driven selection of one extra G2P converter per name origin
                                                 (*,DU)                                 train                                9960                                  3000                                 1680                                 3360                                 1560
                                                                                                                                                                                                                                                                                                                                                        Strategy 2: Creating pronunciation variants (lexical modeling)
                                                                                         test                                4440                                  1800                                   720                                1440                                   840
                                                                                                                                                                                                                                                                                                                                                              - Computed per (speaker, name) combination
Speech recognizer: state-of-the-art VoCon 3200 from Nuance                                                                                                                                                                                                                                                                                                    - Created from initial G2P transcriptions by means of automatically learned
• Grammar: name loop with 21K different names (3.5K names of ASNC + 17.5K others)                                                                                                                                                                                                                                                                               phoneme-to-phoneme (P2P) converters


Construction of phoneme-to-phoneme converters
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------


P2P learning requires the orthographic transcription, an initial G2P transcription and a target phonemic transcription (e.g. TY or AV) of a sufficiently large collection of
name utterances. These 3-tuples are supplied to a 4 step training procedure:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                 High level                                                                                                 Initial                                            Target
• Two-fold alignment: Orthography ↔ Initial transcription ↔ Target transcription                                                                                                                                                                                                                                                                                                                                                                                  features
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Orthography
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        transcription                                       transcription

                      ~                D                  i                r                k               ()               V                 a                n                ()               D                 e                n                ()                ~                B                o                ~             ssch                   e
                        „               d                 I                r               K                 _                f                A                n                 _                d                E               n                  _                „                b               O                  .                 s                 $                                                                                                                                                           Alignment process                                                         Alignment process
                        „               d                 i                r               k                 _                v                A                n                 _                d                $               m                  _                ~                b               O                  .                 s                 $                                                                                                                                                              (letter-to-sound)                                                       (sound-to-sound)

• Transformation retrieval
• Generation of training examples: describe linguistic context
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Transformation
                           Previous and next phonemes and graphemes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              learning
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Learn morphological
                           Lexical context (Part Of Speech)                                                                                                                                                                                                                                                                                                                                                                                                                                    classes
                           Prosodic context (stressed syllable or not)
                           Morphological context (word prefix/suffix)
                           External features: e.g. name type, name source, speaker tongue                                                                                                                                                                                                                                                                                                                                                                                                                                           Example generation
• Rule induction
                      Learn decision tree per input (pattern): stochastic rules in leaf nodes
                      Rule formalism: if context → leaf node then [input pattern] → [output pattern] with probability Pfir                                                                                                                                                                                                                                                                                                                                                                                                    Stochastic rule induction

In generation mode: rules applied to initial G2P transcription of unseen name  variants with probabilities

Experimental assessment
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- -------------------------------


Incorporating NN1 language knowledge                                                                                                                                                                                                                                                                                                                                                                                          Table 2: Name Error Rate (%) for systems with G2P lexicons
          • Including extra G2P transcriptions (acoustic model = AC-MONO)                                                                                                                                                                                                                                                                               (spkr,name)                                                                            System                                                                           DU                              EN                             FR                             MO                              TU

                     - Boost for (DU,-DU): NAT speakers use NN1 knowledge when                                                                                                                                                                                                                                                                                  (DU,*)                         AC-MONO + DUN G2P                                                                                                                 6.5                          38.5                           21.3                            14.6                           28.4

                       reading foreign names, including NN2 names                                                                                                                                                                                                                                                                                                                              AC-MONO + 4G2P (nativized)                                                                                                        7.2                          22.7                             9.9                            9.5                           17.2
                     - Degradation for (DU,DU): reduced by selecting only one extra G2P                                                                                                                                                                                                                                                                                                        AC-MONO + G2P-selection (nativized)                                                                                               6.5                          20.8                             7.2                            9.0                           18.1
          • Decoding with multilingual acoustic model                                                                                                                                                                                                                                                                                                                                          AC-MULTI + G2P-selection (nativized)                                                                                              8.5                          14.9                             7.2                            8.3                           16.2
                     - NAT speakers: loss for NAT names, boost for English names only                                                                                                                                                                                                                                                                                                          AC-MULTI + G2P-selection (plain)                                                                                                  8.5                          14.0                             7.7                            8.6                           18.1
                                 Dutch sounds not as well modeled as before                                                                                                                                                                                                                                                                                    (*,DU)                         AC-MONO + DUN G2P                                                                                                                 6.5                          25.1                           33.2                            26.9                           40.8
                                 English better known than French?                                                                                                                                                                                                                                                                                                                            AC-MONO + 4G2P (nativized)                                                                                                        7.2                          22.8                           32.2                            27.0                           40.6
                                 English and Dutch sound inventories differ more than French and Dutch?
                                                                                                                                                                                                                                                                                                                                                                                               AC-MONO + G2P-selection (nativized)                                                                                               6.5                          22.8                           31.1                            25.3                           38.5
                     - Foreign speakers: boost for both NN1 name origins
                                                                                                                                                                                                                                                                                                                                                                                               AC-MULTI + G2P-selection (nativized)                                                                                              8.5                          17.6                           22.6                            25.2                           38.6
                                - mother tongue sounds better modeled
                                                                                                                                                                                                                                                                                                                                                                                               AC-MULTI + G2P-selection (plain)                                                                                                  8.5                          18.2                           22.6                            25.8                           40.4
          • Plain multilingual G2P transcriptions bring no improvement
Creating pronunciation variants                                                                                                                                                                                                                                                                                                                                                                           Table 3: Name Error Rate (%) for systems with P2P transcription variants
                                                                                                                                                                                                                                                                                                                                                      (spkr,name)                                                                            System                                                                            DU                              EN                             FR                             MO                              TU
          • Baseline P2Ps: Dutch G2P transcriptions as initials, AV transcriptions as targets                                                                                                                                                                                                                                                                  (DU,*)                         AC-MULTI + G2P-selection (nativized)                                                                                             8.5                           14.9                             7.2                            8.3                           16.2
          - Alternative P2Ps for (DU,NN1) and (NN1,DU) cells                                                                                                                                                                                                                                                                                                                                  + 4 P2P variants (baseline)                                                                                                      7.7                           13.2                             6.3                            7.0                           11.9
             - create additional P2P that starts from NN1 G2P transcriptions                                                                                                                                                                                                                                                                                                                  + 4 P2P variants (alternative)                                                                                                   7.7                           12.2                             6.3                            7.0                           11.9
             - combine most probable variants generated by both P2P converters                                                                                                                                                                                                                                                                                 (*,DU)                         AC-MULTI + G2P-selection (nativized)                                                                                             8.5                           17.6                           22.6                            25.2                           38.6
          • P2P variants lead to significant improvements for all (speaker, name) cells                                                                                                                                                                                                                                                                                                       + 4 P2P variants (baseline)                                                                                                      7.7                           17.2                           19.9                            24.0                           35.2

             - 10 .. 25% relative for NAT + foreign names , 5 .. 17% for foreign speakers                                                                                                                                                                                                                                                                                                     + 4 P2P variants (alternative)                                                                                                   7.7                           16.4                           18.8                            24.0                           35.2




Acknowledgments                                                                                                                                                                                                                                                                                                                                         References
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------                  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------

The presented work was carried out in the Autonomata TOO project, granted under the Dutch-Flemish STEVIN                                                                                                                                                                                                                                                [1] B. Réveil, J.-P. Martens and B. D‟hoore, How speaker tongue and name source language affect the automatic
program (http://taalunieversum.org/taal/technologie/stevin/), with partners RU Nijmegen, Universiteit Utrecht,                                                                                                                                                                                                                                          recognition of spoken names, in Proc. InterSpeech 2009, UK, Brighton
Nuance and TeleAtlas.                                                                                                                                                                                                                                                                                                                                   [2] H. van den Heuvel, B. Réveil and J.-P. Martens, Pronunciation-based ASR for names, in Proc. InterSpeech
                                                                                                                                                                                                                                                                                                                                                        2009, UK, Brighton
                                                                                                                                                                                                                                                                                                                                                        [3] B. Réveil, J.-P. Martens and H. van den Heuvel, Improving proper name recognition by adding automatically
                                                                                                                                                                                                                                                                                                                                                        learned pronunciation variants to the lexicon, in Proc. LREC 2010, Valletta, Malta

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:11/9/2011
language:English
pages:1