Docstoc

queens - PowerPoint

Document Sample
queens - PowerPoint Powered By Docstoc
					Mandarin-English Information (MEI):
 Investigating Translingual Speech Retrieval

                  Johns Hopkins University
          Center of Language and Speech Processing
                   Summer Workshop 2000
 Douglas W. Oard, Gina-Anne Levow, and Jianqiang Wang
          Institute for Advanced Computer Studies
            University of Maryland, College Park

  October 12, 2000        Queens College
             The MEI Team
• Senior Members
  Helen Meng          Chinese University of Hong Kong
  Erika Grams         Advanced Analytic Tools
  Sanjeev Khudanpur   Johns Hopkins University
  Gina-Anne Levow     University of Maryland
  Douglas Oard        University of Maryland
  Patrick Schone      US Department of Defense
  Hsin-Min Wang       Academia Sinica, Taiwan

• Students
  Berlin Chen         National Taiwan University
  Wai-Kit Lo          Chinese University of Hong Kong
  Karen Tang          Princeton University
  Jianqiang Wang      University of Maryland
                  Outline
• Motivation
• MEI system design
  – Query construction
  – Audio indexing
  – Evaluation
• Enhancements
  – Phonetic cognate matching
• Conclusion
Emerging Audio Search Applications
 • Internet audio
   – Compaq SpeechBot (7,819 hours on 10/9/00)
      • http://speechbot.research.compaq.com
 • Historical audio
   – National Gallery of the Spoken Word
      • http://www.ngsw.org
 • Personal audio
   – Informedia Experience-on-Demand
      • http://www.informedia.cs.cmu.edu/eod/
           Global Internet Audio


    Almost 2000        529                       English
 Internet-accessible
                                                 Other
Radio and Television         1367
                                                 Languages
       Stations



                                source: www.real.com, Feb 2000
          Global Internet User Population
                   2000                                                        2005
                          5%                                                             8%
                                                                  8%
                                      9%
                                                                                                  8%
                                                                             12%
                                                               32%
                                            6%
                                                                                                         5%
                          40%
                                                                      English
                                                                                                          3%
          English                             5%
                                                                                   6%
    52%
                                              5%
                                                                                   4%   Chinese
                                            4%

                                           4%                    6%                                     21%
                                                                               8%
                                       3%                              3%
                                      2%                                  5%2%
                                                                              2%   5%    2%
                                            5%                          5%
                                 2%              2%       6%      2%
                             3%
Spanish    Japanese       German            French             Spanish        Japanese        German
Chinese    Scandanavian   Italian
                            Spanish         Dutch
                                             Japanese           French
                                                               German        Chinese
                                                                             French           Scandanavian
Korean     Portuguese     Other
                            Chinese         English
                                             Scandanavian       Italian
                                                               Italian       Dutch
                                                                             Dutch            Korean
                          Korean             Portuguese         Portuguese
                                                               Other         Other
                                                                             English          English



                                                                                               Source: Global Reach
                  The Big Picture
                                    Mandarin Audio News Broadcasts



   English-to-Mandarin                        Mandarin Audio
       Translation                              Indexing



                         Retrieval Engine

                                  Ranked list of
                                  Mandarin audio
                                  documents



English Text Exemplar
Cross-Language SDR Challenges
• Query processing (translation)
  – Tokenization
  – Translation lexicon coverage
  – Selection among alternate translations

• Document processing (recognition)
  – Tokenization
  – Recognition lexicon coverage
  – Selection among alternate recognition hypotheses
      The Answer: A Preview
• Perform word-based speech recognition
  – Lexicon constraints greatly improve accuracy
• Perform phrase-based query translation
  – Minimizes translation ambiguity
• Convert both to character bigrams and match
  – Elegantly handles ambiguous term granularity
• Add evidence from proper name matching
  – Using syllable bigrams
English Exemplar                                               Relevance Judgments
                          LDC          Bilingual                   000100010000010100
President Bill Clinton and…
                                         Term                  LDC
                                         List
Named
                  Term               Term                 Query
 Entity
                 Selection         Translation          Construction
Tagging
BBN                                                                    Ranked
                                                        Mandarin        List
                                               U Mass
                                                        IR System

Mandarin Audio                                                            Evaluation
                                 Speech               Document                    Cornell
                               Recognition           Construction             Mean
 LDC                          Dragon                                      Uninterpolated
                                                                            Average
                                                   LDC      Story
                                                                            Precision
                                                          Boundaries
           Bilingual Term List
• Combination of
  – LDC English-Chinese bilingual term list
  – Chinese-English Translation Assistance File (CETA)
    [inverted]
         Total English Terms        199,444
         Total Translation Pairs    395,216

         Phrasal Terms              81,127
         Phrasal Translation Pairs 105,750
             Term           # translations
             human                       7
             right(s)                   30
             human rights                1
                  Query Term Selection
• Tagged named entities (BBN Identifinder)
            • Person:            “partners of Goldman, Sachs, & Co.”
            • Organization:      “UN Security Council”



• Dictionary-based “phrases”
    – translatable multi-word units
            • “Wall Street,” “best interests,” “guiding principles,” “human rights”
    – automatic tagging: greedy, left-to-right, max match

•      2
            filtering
    – Compared to an English background model
          Query Term Translation
• Named entities
  – if absent from dictionary, translate individual terms
     • e.g. “Security Council” vs. “First Bank of Siam”


• Numeric Expressions
  – special processing for digits
     • e.g. “12:30 pm, June 15, 1969”


• Remaining terms
  – Consult bilingual term list, lemmatize if necessary
     • e.g. “televised” translates as “television”
             Query Construction
• Unbalanced queries
  – Use all plausible translations for each term


• Balanced queries
  – Pseudo-term weight: average of translations’ weights


• Structured queries
  – Recompute pseudo-term weight from translations’ term
    frequency and document frequency
English Exemplar                                               Relevance Judgments
                          LDC          Bilingual                   000100010000010100
President Bill Clinton and…
                                         Term                  LDC
                                         List
Named
                  Term               Term                 Query
 Entity
                 Selection         Translation          Construction
Tagging
BBN                                                                    Ranked
                                                        Mandarin        List
                                               U Mass
                                                        IR System

Mandarin Audio                                                            Evaluation
                                 Speech               Document                    Cornell
                               Recognition           Construction             Mean
 LDC                          Dragon                                      Uninterpolated
                                                                            Average
                                                   LDC      Story
                                                                            Precision
                                                          Boundaries
       Characteristics of Mandarin
• 400 syllables achieve full phonological coverage

• 6,800 GB characters achieve full textual coverage
  – Each pronounced as a single syllable

• Segmentation ambiguity
  – No space between words
  – Disagreement over “correct” segmentation
Alternative Scales for Terms

                          Words




                          Characters




                          Syllables
        Which Scale for Terms?

Term: …Iraq...
English Exemplar                                               Relevance Judgments
                          LDC          Bilingual                   000100010000010100
President Bill Clinton and…
                                         Term                  LDC
                                         List
Named
                  Term               Term                 Query
 Entity
                 Selection         Translation          Construction
Tagging
BBN                                                                    Ranked
                                                        Mandarin        List
                                               U Mass
                                                        IR System

Mandarin Audio                                                            Evaluation
                                 Speech               Document                    Cornell
                               Recognition           Construction             Mean
 LDC                          Dragon                                      Uninterpolated
                                                                            Average
                                                   LDC      Story
                                                                            Precision
                                                          Boundaries
              Evaluation Collection
                 Development                             Evaluation
                Collection: TDT-2                     Collection: TDT-3
               Jan 98              Jun 98   Oct 98                      Dec 98

English text        17 topics,                           56 topics,
topic exemplars: variable number
                                                      variable number
Associated Press of exemplars
                                                       of exemplars
New York Times
                               2265                         3371
Mandarin audio
                             manually                manually segmented
broadcast news:
                            segmented                      stories
Voice of America
                              stories
                        Mar 98     Jun 98

   Exhaustive relevance assessment based on event overlap
 President Bill Clinton and Chinese President Jiang Zemin engaged
in a spirited, televised debate Saturday over human rights and the
Tiananmen Square crackdown, and announced a string of agreements
on arms control, energy and environmental matters. There were no
announced breakthroughs on American human rights concerns,
including Tibet, but both leaders accentuated the positive …

                  Query by
      English     Example
     Newswire
     Exemplars


     Mandarin
      Audio
      Stories

美国总统克林顿的助手赞扬中国官员允许电视现场直播克林顿和江泽民在首脑会晤后举行
的联合记者招待会。。特别是一九八九镇压民主运动的决定。他表示镇压天安门民主运动
是错误的,他还批评了中国对西藏精神领袖达 国家安全事务助理伯格表示,这次直播让中国
人第一次在种公开的论坛上听到围绕敏感的人权问题的讨论。在记者招待会上 …
             Abstract Task Model

American English                      Mandarin Chinese
 Text Exemplar                        Broadcast News


                   Cross-Language
                   Speech Retrieval




                      Ranked List
                    of News Stories
Evaluation of Ranked Lists
        Relevance Judgments

VOA 0427.22               Relevant
VOA 0521.14                   Not
VOA 0604.39                   Not
VOA 0419.12               Relevant
VOA 0513.17               Relevant
VOA 0527.13                   Not
    …




                              …
                                     Recall-Precision Graph
                         1.0
Interpolated Precision




                         0.5




                         0.0
                               0.0      0.2   0.4        0.6   0.8   1.0
                                                    Recall
Interpolated Precision         Variation Across Exemplars
                         1.0



                         0.5



                         0.0
                               0.0   0.2   0.4            0.6   0.8   1.0
                                                 Recall
Mean Interpolated Prec.         Average Across Exemplars
                          1.0




                          0.5

                                       0.353
                          0.0
                                0.0   0.2      0.4        0.6   0.8   1.0
                                                     Recall
                            Variation Across Topics

                      1.0
Mean Uninterpolated
 Average Precision




                      0.8
                      0.6
                      0.4
                      0.2
                      0.0
                                       Topic
       Translation Granularity
                                                       0.6
• Phrases beat words
                                                       0.5




                              Mean Average Precision
• Three sources                                        0.4
  – Translation lexicon
                                                       0.3
  – Named entities
  – Numeric expressions                                0.2

                                                       0.1

                                                        0
                                                             Words    Phrases   Phrases +
                                                                                NE/NUMEX

                                                                     Strategy
   Condition: TDT-2, 12 exemplars, word-based retrieval
   Query Translation Strategies
                                                         0.6
• Structured queries
   – Treat translation                                   0.5




                                Mean Average Precision
     alternatives as synonyms
                                                         0.4
   – Inquery #syn() operator
                                                         0.3

• Balanced translation                                   0.2
   – Distribute probability
                                                         0.1
     mass over translation
     alternatives                                         0
                                                               Structured    Balanced
   – Inquery #sum() operator                                    Queries     Translation

                                                                      Strategy
   TDT-2, phrase-based translation, word-based retrieval
            Retrieval Granularity
                                                        Words     Char    Syllable
• Character bigrams are best
                                                        0.6




                               Mean Average Precision
• Words give mixed results                              0.5

                                                        0.4
• Syllable bigrams do poorly                            0.3

                                                        0.2

                                                        0.1

                                                         0
                                                               Xinhua      New York
                                                              Mandarin   Times Manual
                                                              Newswire    Translation

                                                                 Exemplars
                       TDT-2
               Improving Translation
    • Two types of errors                                     0.6


       – Insertion (wrong sense)                              0.5




                                     Mean Average Precision
       – Deletion (missing)                                   0.4

                                                              0.3

    • Both hurt                                               0.2

                                                              0.1

                                                               0
                                                                    Automatic Insertions Deletions  Both
                                                                              Corrected Corrected Corrected

                                                                                 Strategy



TDT-2, topic 1, exemplar 1, word-based translation, word-based retrieval
Untranslatable Terms
 Terms # (by token) # (by type)
  total   87,004      12,402
  OOV      3,028       1,122

   Term         Occurrences
   suharto      97
   netanyahu    88
   starr        62
   arafat       50
   bjp          45
   vajpayee     44
   estrada      44
   ….
   hsu          19
   zemin        7
Key Idea: Phonetic Cognate Matching
        Kosovo (/ke1-suo3-wo4/, /ke1-suo3-fo2/,
                /ke1-suo3-fu1, /ke1-suo3-fu2/)


    English Query              Mandarin Audio
      Exemplar                   Document

   ……..Kosovo…...
                                  …../ke-suo-fo/….




       Sound alike --> match in phonetic space?
 Cross-Language Phonetic Matching
           christopher


Learned English pronunciation           Transformation error-driven learning
(based on PRONLEX lexicon)              82% phoneme/45% word accuracy
                    English Phones: /kk rr ih ss tt aa ff er/
    Syllable nuclei insertion            Hand-crafted transformation rules
                    Augmented English phones: /kk ax rr ih ss ax tt aa ff er/
Cross-language phone mapping Transformation error-driven learning
(based on list of transliterations) Finite state, trained on 4,800 pairs
                    Mandarin “phones”: /k e l i s i t uo f u/
 Phone lattice generation and           Syllable bigram language model
 N-best hypothesis generation           Reused from an ASR system
                   One-best syllables: /ji li si te fu/
    Compare to /ke li si tuo fu/        Most common transliteration
    Multiscale Query Construction
                  English Exemplar Terms
Israeli <Ph>Prime Minister</Ph> <NE>Benjamin Netanyahu</NE>

           Mandarin Words and Syllable Bigrams
                                       ne-tan tan-ya ya-hu


         Mandarin Character and Syllable Bigrams
                                       ne-tan tan-ya ya-hu


                 Mandarin Syllable Bigrams
    yi-se se-lie shou-xiang ben-jie jie-ming ne-tan tan-ya ya-hu
 Cross-Language Phonetic Matching
                                                                    no CLPM      CLPM
• Small improvement                                               0.6
   – Not statistically significant
                                                                  0.5




                                         Mean Average Precision
• Character bigrams are best                                      0.4

   – Form a unified index                                         0.3
      • Character and syllable bigrams
                                                                  0.2
   – Translate words if possible
      • Then form character bigrams                               0.1

   – Otherwise translate syllables                                 0

                                                                        Word




                                                                                        Syllable
                                                                               Char
      • Then form syllable bigrams

                                                                        Indexing Terms

               TDT-2, phrase-based translation
                                              Mean Average Precision




                                      0.2
                                            0.3
                                                  0.4
                                                        0.5
                                                              0.6
                                                                    0.7
                                                                          0.8


                                Words


                             Character
                              Bigrams


                            Character
                           1+2+3-grams
                                                                          TDT2




                               Words +
                                CLPM
                                                                          TDT3




                             Character
                             Bigrams +



Phrase-based translation
                               CLPM
                                                                                 Summary Results




                              Syllable
                              Bigrams
                              +CLPM

                               Syllable
                               Lattice
                                        Main Results
                          0.6
                                             TDT2       TDT3
Mean Average Precision




                         0.55



                          0.5



                         0.45



                          0.4
                                Words       Character Bigrams   Character Bigrams +CLPM
                   Conclusion
• Match term extent to purpose
  –   Words for speech recognition
  –   Phrases for translation
  –   Character bigrams for word-based retrieval
  –   Syllable bigrams for phonetic cognate matching

• Phonetic cognate matching appears promising
  – New architecture based on syllable indexing

• Some variation across collections
  – TDT Mandarin audio collections are quite small
                  Future Work
• Phonetic cognate matching
  – More training data
  – Learn character and syllable weights (unified index)
  – Separate character and syllable indices + merging

• Blind relevance feedback
  – Pre-translation exemplar expansion (enrich vocabulary)
  – Post-translation query expansion (translation errors)
  – Post-recognition document expansion (ASR errors)

• Use of acoustic scores in the syllable lattice
   For More Information


http://www.clsp.jhu.edu/
 ws00/groups/mei/
Backup Slides
            Syllable Lattice Construction
   VOA Audio
                     Forced         Speaker
                     Alignment      Adaptation
Dragon Syllables
                                                 Speaker-Adapted
                                                 Acoustic Models
                          Baseline
                          Acoustic Models

                                                  Syllable
                                                  Recognition
                         1998 Xinhua                            MEI Syllables
                         Language Models



       •   Forced alignment with Dragon’s output for each document
       •   Blind speaker adaptation with Dragon’s syllables
       •   Syllable trigram language model
       •   Syllable accuracy: 90.0%
 Four-Stage Backoff Translation
• Stemming backoff
  – Porter Stemmer
    Look up term directly in bilingual term list

           Look up stemmed term

                  Look up term in stemmed list

                         Look up stemmed term in
                         stemmed list
         Manual Term Selection
• Use much less terms
  – Avg. 4 per query with term number constraint
  – Avg. 7 per query without the constraint
• Can’t outperform           2
                                  term selection
  – mean average precision:
     •   manual with constraints: 0.2388
     •   manual without constraint: 0.2686
     •   automatic     2
                          : 0.2939
     •   significant only between two manual selections
• Suggest to select terms from other sources
      Supporting the Search Process
 Source        IR System   Predict           Nominate              Choose
Selection

               Query
                                Query
             Formulation

                              Search           Ranked List


                       Query Reformulation   Selection         Document
                               and
                       Relevance Feedback

                                                             Examination    Document

        Source
      Reselection
                                                                           Delivery

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:13
posted:10/17/2010
language:English
pages:46