Collaboration on Named Entity Discovery in Thai Agricultural Texts by nuhman10


									Collaboration on Named Entity Discovery in Thai
Agricultural Texts
Asanee KAWTRAKUL1, Nigel COLLIER2, Koichi TAKEUCHI2, Kenji ONO2


This paper outlines our collaboration on the construction of a named entity recognizer for the Thai
language. This tool will support natural language processing applications in Thai such as high precision
information retrieval, summarization and machine translation. Named entity recognition has been quite
successfully applied to English due to the interest generated by evaluation conferences such as MUC
(Message Understanding Conference) in the 1990s and IREX for Japanese. Currently no such tool exists
for the Thai language. Initially we will apply models based on machine learning that have been successful
for English and test their applicability to the Thai language. The application domain we will focus on is
agriculture and we intend to build a test collection to train and evaluate our system.
Keywords: Thai language, information extraction, named entity recognition

1. Introduction

Named entity (NE) extraction is emerging as a key technology in the development of the next generation
of information access tools that will help machines to understand the contents of textual documents. NE
emerged as one of the sub-tasks of the DARPA-sponsored Message Understanding Conferences (MUCs),
e.g. (MUC, 1998), held from about 1988 to 1998 which helped to formalize information extraction as an
application of natural language processing (NLP). NE's main role was to identify expressions such as the
names of people, places and organizations as well as date and time expressions. Such expressions are hard
to analyze using traditional NLP because they belong to the open class of expressions, i.e. there is an
infinite variety and new expressions are constantly being invented.
      Although both commercial and academic NE tools exist for English language texts, so far there has
been relatively little work in non-European languages with the exception of Japanese for which the IREX
conference (Sekine and Isahara, 1998) was held in 1998. Since the needs for language processing
technology in non-European languages are as great as those of English, we are now exploring the
application of NE technology to the Thai language. In particular we are focusing on the Thai agricultural
domain which we expect will provide a useful and challenging test of this technology.
      Existing agricultural texts are scattered throughout organizations and are written by many individuals
using several data models and platforms. Moreover, the amount of agricultural information has been
increasing exponentially as we have entered the Internet era. Despite the amount and availability of textual
information, the usefulness of this data should be considered with regards to how easily it can be
understood and synthesized by human readers. We intend that the named entity tool should form part of an
improved information access system which aids humans in finding the information that is useful for their
needs. The development of this tool is a collaborative project between the authors of this paper at the
National Institute of Informatics and the NaiST laboratory at Kasetsart University.
      To achieve this goal, at Kasetsart University we have started a project which aims to support the
development of information discovery and Web-based exchanging information in the agricultural domain.

  NAiST, Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok,
Thailand. Email:
  National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan.
                    NETWORKS AND SYSTEMS (WAINS 8)
The gist of information will be extracted by using frame and index as the drivers. The frame- based
summarization is, then, translated in to the target language.
     The proposed system consists of 4 modules: the Gathering Module for retrieving related agricultural
information consisting of summary and technical reports, the Indexing and Clustering Module for
classifying the information, the Summarizing Module for extracting the essential sentences and filling into
the predefined frames, and the Translation Module for translating the complete frame to the target
language. At the current state, we focus on English to Thai translation. This is outlined further in Section
(3). NE technology is expected to contribute in several of these modules, in particular for high-precision
indexing and for intelligent summarization.
     In the next sections we will review current NE technology for the English language, provide an
overview of the KU system design and also outline a corpus collection which we propose to build.

2. Technology Background

Machine learning aims to construct machines capable of learning from experience and is applied to tasks
that cannot be easily solved using traditional programming approaches. In the type of machine learning
which is commonly used for named entity extraction, called supervised learning, the function is learnt
automatically using a set of input/output examples - referred to as training data.
      Named entity tools try to annotate texts, i.e. to denote regions that belong to pre-specified semantic
classes such as PERSON or ORGANIZATION. At the most simple level we want to decide if a region of
text belongs to a class and if so then to which one. We are concerned only with those element classes that
represent concepts in the domain ontology that correspond to terminological, date or numeric expressions
at the text level, i.e. with the semantics of the text. For example, a person's name, a protein name, a
chemical formula or a date expression may be a valid candidate for tagging depending on whether it is
contained in the ontology, but text regions corresponding to other types of content such as a title, a caption
or a table are not. This can be viewed as a type of multiple classification task and there are several
effective and well studied learning algorithms available for this such as Hidden Markov Models (HMMs)
(Rabiner and Juang, 1986) which we describe in more detail below.
      HMMs can be considered to be stochastic finite state machines and have enjoyed success in a number
of fields including speech recognition and part-of-speech tagging (Kupiec, 1992) in a number of different
languages. It has been natural therefore that these models have been adapted for use in other word-class
prediction tasks such as the named-entity task in IE. Such models are often based on n-grams. Although
the assumption that a word's part-of-speech or name class can be predicted by the previous n-1 words and
their classes is counter-intuitive to our understanding of linguistic structures and long distance
dependencies, this simple method does seem to be highly effective in practice. Nymble (Bikel et al., 1997),
a system which uses HMMs is one of the most successful such systems and trains on a corpus of marked-
up text, using only character features in addition to word bigrams.
      Recently the application of NE methods has been extended to the identification and classification of
terminology in domains outside news. For example, studies into the use of supervised learning-based
models in the micro-biology domain have shown that models based on HMMs (Collier et al., 2000) and
decision trees are much more generalisable and adaptable to new classes of words than systems based on
traditional hand-built patterns and domain specific heuristic rules such as (Fukuda et al., 1998),
overcoming the problems associated with data sparseness with the help of sophisticated smoothing
algorithms (Chen and Goodman, 1996).
      Although it is still early days for the use of HMMs for IE, we can see a number of trends in the
research. Systems can be divided into those which use one state per class such as Nymble (at the top level
of their backoff model) and those which automatically learn about the model's structure such as (Seymore
et al., 1999). Additionally, there is a distinction to be made in the source of the knowledge for estimating
transition probabilities between models which are built by hand such as (Freitag and McCallum, 1999) and
those which learn from tagged corpora in the same domain such as the model presented in this paper, word
lists and corpora in different domains - so-called distantly-labeled data (Seymore et al., 1999).

                    NETWORKS AND SYSTEMS (WAINS 8)
3. Adapting Named Entity Technology for the Thai Language

The Thai language belongs to the class of languages known as isolating or tone languages. Thai sentences
do not use spaces as delimiters between words as in the case of other Asian languages such as Japanese,
Korea and Chinese. There is therefore a problem of finding word boundaries and unambiguously applying
grammatical tags became of the problems of incomplete sentence errors, and semantic inconsistency within
word sequences. This puts a greater burdon on morphological analysis than in European languages, i.e., we
need to segment a sentence into a valid sequence of word units (Nobesawa, 1994; Seung-Shik Kang, 1994).
      Ambiguity in morphological analysis leads to a related problem which we call implicit spelling
errors, and this also needs to be solved at morphological processing level. Implicit spelling errors are
errors in segmenting the Thai sentence that lead to incorrect words (for that sentence) that are nevertheless
correct words in the Thai lexicon. Our earlier work attempted to provide a robust morphological analyzer
by using a gradual refinement module for detecting as many possible alternatives and/or the erroneous
chains of words caused by this non-trivial problem. The Thai morphological analysis program we
developed (Kawtrakul et al., 1997) is therefore faced with three nontrivial problems, i.e., word boundary
detection, part of speech tagging ambiguity, and, implicitly spelling errors. Our work attempted to
provide a computational solution, called Word Filtering, to handle those three points prior to parsing.
      Unlike in many European languages where there is relatively rich orthographic information to aid in
the selection of NEs, Thai orthography has no equivalent of uppercase characters to distinguish between
proper nouns and other classes of words.
      Furthermore, loan words, acronyms, synonyms and definite anaphora are often used in the real texts.
Words, which are unknown in the Thai lexicon, are becoming one of the major problems for applied
natural language processing. Typically parsing a sentence including unknown words will degrade the
performance of text understanding and information retrieval processes. In this respect, the ability of an
automatic tagging system to handle both known and unknown words is needed.
      Another important problem for our application of NE to Thai, which we usually find in texts, are loan
words. Like other languages such as Japanese and Korean usually imports loan words from English and
other languages, especially in Agriculture. These terms are borrowed from foreign languages and usually
have been transliterated into Thai orthography, although not always according to a phonetic standard. In
(A. Kawtrakul, et al., 1997), a loan word is classified as one of six types of unknown words. Based on an
analysis of unknown words from 30,401 words of technical reports and microcomputer magazines, we
found that the average number of unknown words is about 15%, of which 68.28% are loan words and
21.93% are foreign words. Accordingly, automatic backward transliteration has great practical importance
in Thai language processing systems.
      The challenges that we have noted above are illustrated in Table 1.

Table 1 Comparison of named entity for Thai and English

                                     Thai                                 English
1. Tokenization                      Advanced morphological analysis      English word tokenization is
                                     is needed as Thai sentences do       generally not complicated except
                                     not use spaces as delimiters. E.g.   in cases where there are
                                                              .           ambiguous uses of punctuation
                                                         ฝ                markers.
                                     There is also the danger of
                                     ambiguity between misspelling
                                     words and named entity.
                                     Example       
2. Local structural ambiguity        Thai named entities exhibit          English named entities exhibit
                                     structural ambiguity in              structural ambiguity in
                                     prepositional phrase (PP)            prepositional phrase (PP)

                    NETWORKS AND SYSTEMS (WAINS 8)
                                    attachment and in conjunction        attachment and in conjunction
                                    scope. Example                       scope. Examples: Victoria and
                                                                         Albert Museum, IBM and Bell
3. Orthography                      Thai orthographical knowledge is     3. Capitalization can often be a
                                    not available to signal word         valuable clue for identifying
                                    classes as in the use of uppercase   proper names, and other
                                    character to set off proper name     orthographic features can be
                                    from other classes in English.       useful in technical terminology
                                    Example                              NE. Examples: IBM, London
                                                                         Stock Exchange, Interleukin-2
                                    (The Productivity of Buffalo         protein.
                                    Kept under Village Conditions in
                                    North-East Thailand)
4. Semantic ambiguity               Problem of categorized the           Semantic ambiguity in assigning
                                    named entity (for example,           classes to NEs exists in English.
                                    person’s name is the same name
                                    as a location name).
                                    Example         can be both the
                                    surname of Thai prime minister
                                    and the building.
5. Temporal information             Thai has no explicit word to
                                    inform temporal relation and
                                    anaphora resolution since Thai
                                    often uses zero anaphora.

4. Overview of the NAiST System

The proposed system consists of 4 modules (see Fig. 1).

Gathering Module: Two types of agricultural related information will be collected, i.e., web based
information and technical paper abstracts from database. The former will be gathered by using intelligent
agent, called web robot. The latter is provided by the Department of Agriculture, Ministry of Agriculture
and Cooperatives and from Food and Agriculture Organization. The Web robot automatically extracts the
related articles from the Internet by specifying the starting URL. Those HTML formatted information will
be stripped to be plain text format and, then, be stored in the EBG (extended binary graph) data structure
based on the Active Hypermedia Delivery System Architecture (AHYDS) (Andres et al., 1998)
Indexing and Clustering Module: Those collected articles will be indexed and classified automatically
by using the multilevel indexing and clustering model (Kawtrakul et al., 1996; Kawtrakul, 1997).
Automatic multilevel indexing consists of three modules: lexical token identification, phrase identification
and multilevel index generation. Each module accesses different linguistic knowledge bases. The role of
lexical token identification is to determine word boundaries, and to recognize name entities, such as
technical terms. In order to compute multilevel indices, phrase identification and relation extractions are
needed. The relation between terms in the phrase will be extracted in order to define indices in single term
level and conceptual level. Summarizing Module: The essential sentences will be extracted according to
indices, linguistic cues and/or expressions. This information will be parsed and filled into the predefined
frame or table, called translation template. The translation templates are generalized from technical report
and summary structures. Shallow parsing is used to extract the head of phrase and main events.
Syntactically analyzed words, phrases and/or sentences are stored in the templates.

                    NETWORKS AND SYSTEMS (WAINS 8)
Translation Module: The approach makes use of morphological knowledge and phrase banks as a basis
for translation. A thesaurus quantified similarities of two or more concepts. The generation of translation is
from the templates.

5. Discussion

The names that we are trying to extract fall into a number of categories that are often wider than the
definitions used for the traditional named-entity task used in MUC and may be considered to share many
characteristics of term recognition. For this reason the group at Kasetsart University is cooperating with
the NII PIA (Portable Information Access) (Collier et al., 2001) project to help produce new guidelines
that allow NEs to be consistently and reliably annotated in a wide range of domains and languages.
      The PIA guidelines (publication forthcoming) are a simplification of the MUC guidelines for English
named entity annotation and are extended to allow for linkage to knowledge on the next generation World
Wide Web (the so-called Semantic Web). Critically, the new guidelines also separate the notions of what
to tag from how to tag. Basically PIA NE is interested in annotating entity names, temporal expressions,
number expressions as well as technical terminology. Since technical terminology was not considered in
MUC we expect these new guidelines to be more applicable to our agricultural text collection.

6.1 Annotation and Construction of a Test Collection

Given the importance of agriculture as a key industry for Thailand we have decided to make this our
choice of example domain for evaluation purposes. As mentioned earlier this domain offers several
advantages including the presence of large quantities of text in a structured format and good availability of
domain experts to help with named entity annotation. The types of NEs we expect to focus on will
include names of plants and agriculturalist’s names.
     Before beginning annotation we need to begin construction of a simple ontology (conceptualization of
a domain specification) for specifying the concept classes, their attributes and their hierarchical relations.
This will be done in cooperation with domain experts in agriculture. Following from this we will need to
develop a technical test collection to train and evaluate the effectiveness of our tool.

6. Conclusion

In this paper we have outlined our plan to cooperate on extending named entity technology to the Thai
language, focusing on a new text collection in the agricultural domain. We have given an overview of NE
technology which has been developed for English and Japanese languages and the likely problems we will
have to overcome when adapting the methods to Thai.
     Although our discussion has largely focused on HMMs, there are of course many NE models that are
not based on HMMs that have had success in the NE task at the MUC and IREX conferences. Our main
requirement for implementing a model for the Thai language will initially be power to handle a somewhat
noisy lexical context combined with part of speech information. HMMs seemed to be one of the most
favorable option at this time. Alternatives that have also had considerable success are decision trees and
maximum-entropy. The maximum entropy model shown in (Borthwick et al., 1998) in particular seems a
promising alternative approach because of its ability to handle overlapping and large feature sets within a
well-founded mathematical framework.


F. Andres, A. Kawtrakul, K. Ono and al. 1998. Development of Thai Document Processing System based
on AHYDS by Network Collaboration. In Proc. 5th international Workshop of Academic Information
Networks on Systems (WAINS), Bangkok, Thailand.
D. Bikel, S. Miller, R. Schwartz and R. Wesichedel. 1997. Nymble: a high-performance learning name-
finder. In Porceedings of the Fifth Conference on Applied Natural Language Processing, pages 194-201.

                    NETWORKS AND SYSTEMS (WAINS 8)
A. Borthwick, J. Sterling, E. Agichtein and R. Grishman. 1998. Exploiting diverse knowledge sources via
maximum entropy in named entity recognition. In Proceedings of the Workshop on Very Large Corpora
S. Chen and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. 34th
Annual Meeting of the Association for Computational Linguistics, California, USA, 24-27 June.
N. Collier, C. Nobata and J. Tsujii. 2000. Extracting the names of genes and gene products with a hidden
Markov model. In Proceedings of the 18th International Conference on Computational Linguistics
(COLING’2000), Saarbrucken, Germany, July 31-August 4th.
N. Collier, T. Takeuchi, K. Tsuji, G. Jones, J. Fukumoto, N. Ogata, C. Nobata and K. Ono. 2001. Position
paper proceedings of the International Semantic Web Working Symposium (SWWS), Stanford University,
USA, pages 8-9, 30th July – 1st August, 2001.
D. Frietag and A. McCallum. 1999. Information extraction with HMMs and shrinkage. In Proceedings of
the AAAI’99 Workshop on Machine Learning for Information Extraction, Orlando, Florida, July 19th.
K. Fukuda, T. Tsunoda, A. Tamura and T. Takagi. 1998. Toward information extraction: identifying
protein names from biological papers. In Proceedings of the Pacific Symposium on Biocomputing’98
(PSB’98), Hawai’i,, USA, January.
A. Kawtrakul, C.Thumkanon, T.Jamjanya, Muagyunnan, K.Poolwan and Y.Inagaki 1996. A Gradual
Refinement Model for A Robust Thai Morphological Analyzer, COLING 96. pp. 1086-1089.
A. Kawtrakul, 1997. Automatic Thai Unknown Word Recognition. In Proceedings of the Natural
Language Processing Pacific Rim Symposium, Phuket. pp. 341-346.
J. Kupiec. 1992. Robust part-of-speech tagging using a hidden Markov model. Computer Speech and
Language, 6: 225-242.
MUC. 1998. Proceedings of the Seventh Message Understanding Conference (MUC-7), Fairfax, VA,
USA, May.
L. Rabiner and B Juang. 1986. An introduction to hidden Markov models. IEEE ASSP Magazine, pages
4-16, January.
S. Sekine and H. Isahara. 1998. IREX: Information retrieval, information extraction contest. In
Information Processing Society of Japan Joint SIG FI and SIG NL Workshop, University of Tokyo, Japan,
K. Seymore,, A. McCallum and R. Rosenfeld. 1999. Learning hidden Markov model structure for
information extraction. In Proceedings of the AAAI’99 Workshop on Machine Learning for Information
Extraction, Orlando, Florida, July 9th.

                                                 G athering Module              Indexing and Clustering

                         Web           Pre-          Document                  Lexical Token Identif ication
                         Robot      Processing       Warehouse

                                                                                    Compute Weight

                                    Text Document                                Phrasal Indentif ication
                                 (Agriculture Abstract)                              & Extraction

                                                                                    Mullevel Indexing

            Translation Module                        Summarizing Module
                Translation                                                            Clustering

                                                    Linguistic Knowledge
                                                   - Grammar Rule
                                                   - Frame Structure
                                                   - Multilingual Dictionary

                                        Figure 1. The System Overview
                    NETWORKS AND SYSTEMS (WAINS 8)

To top