VIEWS: 3 PAGES: 11 POSTED ON: 12/2/2011
Named Entity Transliteration with Comparable Corpora Richard Sproat Tao Tao ChengXiang Zhai Coling-ACL 2006 Background Names referring to the same person or location are likely to co-occur in comparable texts. Multilingual NE identification is one of the important tasks to deal with multilingual texts. This paper focuses on finding NE transliteration pairs appearing in comparable corpora. Overview Preprocessing English NE recognition Chinese transliterated candidates extraction Scoring features: Pronunciation Frequency Advanced steps: Score propagation Preprocessing English NE recognition The NE recognizer described in (Li et al., 2004), which is based on the SNoW machine learning toolkit (Carlson et al., 1999) is used. Chinese candidate extraction 495 transliterating characters is used. A sequence of three or more characters is taken as a possible name. If the character “‧” occurs, at least one character to the left and right of this character will be collected, even if the character is not in the list of transliterating characters. Scoring by Pronunciation A source-channel model is adopted. P(e|c) = P(e’|c’) = ∏iP(ei’|ci’) ci’ is fixed to be syllables. ei’ ranges over all possible subsequences of the English phone string. Pronunciations for English words are obtained using the Festival text- to-speech system (Taylor et al., 1998). Training A small list of 721 names in Roman script and their Chinese equivalent. English-Chinese pairs are aligned using the alignment algorithm from (Kruskal, 1999), and a hand-derived set of 21 rules-of-thumb. Good-Turing estimation is used to estimate probabilities for unseen correspondences. To filter implausible transliteration pairs: For an English phone span to correspond to a Chinese syllable, the initial phone of the English span must have been seen in the training data as corresponding to the initial of the Chinese syllable some minimum number of times. 4 for consonant-initial syllables. Scoring by Frequency Names of the same entity in different languages often have correlated frequency patterns due to common triggers such as a major event. Scoring Steps: 1. Pooling all documents in a single day to form a large pseudo- document. 2. For each transliteration candidate (both Chinese and English), computing its frequency in each pseudo-document and obtaining a raw frequency vector. 3. Normalizing the raw frequency vector so that it becomes a frequency distribution over all the time points (days). 4. The Pearson correlation coefficient, cosine (Salton and McGill, 1983), and Jensen-Shannon divergence (Lin, 1991) are used to compute the similarity between two distribution vectors. Pearson performs best. Score Propagation Document pairs that contain lots of plausible transliteration pairs should be viewed as more plausible document pairs. In such a situation we should also trust the putative transliteration pairs more. A co-occurrence relation graph could be built that an edge between (ei, ci, wi) and (ej , cj ,wj) is constructed iff (ei, ci) and (ej , cj ) co-occur in a certain document pair (Et,Ct). P(j|i) is estimated in two different ways: The number of co-occurrences in the whole collection. A mutual information-based method. Evaluation One day’s worth of comparable news articles (234 Chinese stories and 322 English stories) from the Xinhua News agency is taken as test corpus. About 600 English names were extracted. A small number of English names do not seem to have any standard transliteration according to the resources that the authors consulted. This paper ended up with a list of about 490 out of the 600 English names judged. 627 Chinese candidates were generated. The accuracy of the ranked list is measured by Mean Reciprocal Rank (MRR). Some answers (about 20%) are not in Chinese candidate set due to: 1. The answer does not occur in the Chinese news articles we look at. 2. The answer is there, but the candidate generation method has missed it. The MRR for the subset of English names whose transliteration answers are in the candidate list also compute is also computed. Evaluation Two strategies are experimented for combining the two methods: 1. Using the phonetic model to filter out (clearly impossible) candidates and then use the frequency correlation method to rank the candidates. 2. Averaging the scores of these two methods. Error Analysis To further understand the upper bound of our method, we manually add the missing correct answers to our candidate set and apply all the methods to rank this augmented set of candidates. Experiments on score propagation
"Named Entity Transliteration with Comparable Corpora"