Docstoc

cl-20080604

Document Sample
cl-20080604 Powered By Docstoc
					Graph-based Bilingual Phrase
  Sense Disambiguation for
Statistical Machine Translation
       Mamoru Komachi
      mamoru-k@is.naist.jp
         2008-06-04
                   Background
   Success of supervised ML methods depends
    on annotated corpus
       Hard to maintain
   Weakly supervised method requires only
    small amount of tagged data
       Can reduce amount of human effort
              Problems remained
   WSD is crucial to weakly supervised method
       Cf. semantic drift
   Parallel corpora (and dictionaries) may help
    disambiguate word senses
       WSD models in SMT systems gain much
        attention (Carpuat and Wu, 2007)
                     WSD and SMT
   Improving Statistical Machine Translation
    using Word Sense Disambiguation (Carpuat
    and Wu, EMNLP 2007)
       SMT is known to suffer from inaccurate lexical
        choice (based on senseval style sense inventory)
            Domain adaptation problem
       Input is typically a word
       Limited contextual features
    Phrase-based WSD models for SMT
   Sense annotations are derived from phrase
    alignment learned during SMT training
        WSD senses are from the SMT phrasal
         translation lexicon “phrase table”
   Not only words but also phrases are to be
    disambiguated
   Supervised WSD (an ensemble methods of
    naïve Bayes, ME, boosting and a Kernel PCA)
Phrase table is highly ambiguous
   Phrase table constructed from NTCIR-7 J-E
    parallel corpus
       0.5~1GB (in gzip format)
       2.53 candidates per phrase (3.24 candidates
        per phrase for phrases shorter than 5 words)
            Includes function words as well
                                               Plant has 120
Plant ||| 工場
Plant ||| 植物                                   translations in
                  Plant ||| 工場 内
Plant ||| 設備
                  Plant ||| 制御 対象
                                               the phrase
Plant ||| 発電 プラント                              table!
                  Plant ||| 動植物
                  Plant ||| 供給 プラント
                     Motivation
   Propose a novel graph-based approach to
    phrase sense disambiguation
       Can exploit bilingual contextual patterns
   Evaluate phrase sense disambiguation on
    SMT framework
        Monolingual bootstrapping
   Pioneered by (Yarowsky, 1995)
       Learn decision lists from a small set of seed
        instances (input: instance I, output: classifier)
       ‘One sense per discourse’ constraint
                          Bootstrapping
           Iteratively conduct pattern induction and instance
            extraction starting from seed instances
           Can fertilize small set of seed instances




              Instances               Query log              Contextual
                                      (Corpus)                patterns
                 vaio          Compare vaio laptop
                                                              Compare # laptop
      Toshiba satellite   Compare toshiba satellite laptop
                                                                #:slot
             HP xb3000      Compare HP xb3000 laptop
2012/2/18                                 9
             Bilingual bootstrapping
   Word Translation Disambiguation Using
    Bilingual Bootstrapping (Li and Li, ACL-2002)

             …                        …
             Mill                     工場
             Plant                    植物
             Vegetable                …
             …



    corpus               WSD               コーパス
                         classifier
            Formalization of bootstrapping
           Score vector of seed instance     i0  0,...,1,...,0
           Pattern-instance matrix P
                                             行列の(p,i)要素はパターン
                  
                   0    1   0   0          pとインスタンスiの共起
                               
                   1
                      0   0   0
               P                      
                  
                   1    1   0   1    インスタンスの類似度行列をA=PTP
                               
                   0
                      1   1   1   として、このステップを再帰的に行うと
           Iterate                           in=Ani0
                pn  Pin
                                                インスタンスを最終
             in 1  P T pn                    スコア順に出力
           Output ranked instances when stopping criterion
            met

2012/2/18                               11
             Bilingual phrase sense
                 disambiguation
   The (p,i) element of a pattern-instance
    matrix P is a co-occurrence between pattern
    p and instance I
       p: contextual features of both language sides,
        with phrase alignment from GIZA++
       i: candidate (monolingual) phrase to
        disambiguate
       A = PTP
   Similarity is given by the regularized
    Laplacian kernel
             Regularized Laplacian kernel
           Predict final sense by k-NN given the target
            instance
            グラフGのラプラシアンL       L  D A
   次数対角行列Dのi番目の対角要素            D(i,i)   A(i, j)        A:隣接行列
                                                         β:拡散係数
                                             j
                                       

      正則化ラプラシアン行列Rβ            R    n (L) n  (I  L)1
                                      n 0
                                                  
                       
                    ノイマンカーネル行列        K   A  n A n
                                                 n 0
                    においてAの代わりに-Lを使用、右辺第一項のAを削除
                       
2012/2/18                        13
    NTCIR-7 Patent Translation Task
   Large-scale Japanese-English parallel corpus
        2M sentences (comparable to A-E, C-E MT)
        Mainly technical documents
   Timeline
        2008.01: dry run
        2008.05: formal run
        2008.12: final meeting
               NAIST-NTT at NTCIR-7
   Bilingual dictionary extracted (solely) from
    Wikipedia
       Used langlinks from Wikipedia DB
       1:n translations are expanded to n pairs of
        bilingual phrase (en, ja)
       Extracted ~200,000 pairs
            12,000 pairs appear in the training corpus (8.8%)
            44.7% of words (token) is covered by automatically
             constructed bilingual lexicon (GIZA++)
            Learned 1,193 (0.6%) novel translation
        Proposed method (but not yet
             finished training…)
   Extract translation pairs relevant to the given
    domain (patent translation task)
   Construct a pattern-instance matrix P
       Pattern features: bag-of-words feature and link
        features extracted from Wikipedia ja-en abstract
       Instance: translation pair (en, ja)
            Seed instances: 40 translation pairs from the target
             domain
       Apply Laplacian kernel
        Evaluation (BLEU score)
                                          -Wikipedia   +Wikipedia
          Fmlrun-int.en.out               28.24        27.28
          Fmlrun-int.ja.out               26.39        26.48
          Fmlrun-int.ja.out.recased       25.34        25.47
          Fmlrun-int.ja-out.detokenized   20.38        20.52




•E-J translation(en) gets worse with Wikipedia
dictionary
•J-E translation(ja) gives slightly better performance
with Wikipedia dictionary than without
                Future work
   Implement bilingual phrase sense
    disambiguation
   Evaluate this method against IWSLT 2006 J-
    E/E-J and NTCIR-7 J-E/E-J datasets
               Future work(2)
   Automatic extraction of biomedical lexicon
    starting from life-science dictionary (mining
    from MedLine, etc…)
   Summarization (Harendra’s work)…

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:17
posted:2/18/2012
language:
pages:19