DRAMNERI: a free knowledge based tool to Named Entity Recognition Antonio Toral o Grupo de investigaci´n en Procesamiento del Lenguaje Natural y Sistemas de o Informaci´n a Departamento de Lenguajes y Sistemas Inform´ticos University of Alicante, Spain email@example.com Abstract. In this paper we present DRAMNERI, a free software appli- cation which uses rules and gazetteers in order to perform Named Entity Recognition. This system is fully customizable to any speciﬁc domain and it is multilingual. It has succesfully been applied in a domain spe- ciﬁc Information Extraction system and in a Question Answering task. 1 Introduction Named Entity Recognition (NER) is nowadays an important task for the reso- lution of other problems of higher complexity, like Information Retrieval (IR), Information Extraction (IE) or Question Answering (QA), among others. In spite of this, NER was initially only used as a subtask of IE. This is the Natural Language Processing (NLP) task that consists in retrieving relevant information from non structured texts and producing as a result a structured set of data, usually refered as templates. Several subtasks are applied in order to achieve this goal. As we have already pointed out, one of these is NER. As deﬁned in the Message Understading Conference , NER consists in identifying and categorizing entity names wich can include also temporal and/or numerical expressions. As in other NLP techniques, there are two approaches to NER . One is based in knowledge while the other uses a supervised learning algorithm. Regarding resources, the ﬁrst usually uses gazetteers and rules whereas the later needs an annotated corpus. The knowledge based model obtains good results in speciﬁc domains, as the gazetteers can be adapted very precisely, and it is able to detect complex entities, as the rules can be tailored to meet nearly any requirement. However, if we deal with a non restricted domain, it is better to use the learning approach, as it would be very tedious and time consuming to build rules and gazetteers in this case. Because our aim is to classify complex entities in restricted texts, we have adopted the knowledge model. We also wanted our system to be highly ﬂexible and adaptable. That is why we have made almost all possible parameters cus- tomizable (i.e. dictionaries to use, entity categories, length of contexts, etc). This way the system can be easily conﬁgured to work with diﬀerent languages and domains. Moreover, this way it can deal with an open set of entity categories1 . Regarding software licenses, we would like to point out that we strongly agree with Freeling  and Weka  developers that the free availability of basic NLP tools would speed up progress in our area of reasearch. Thus, we modestly contribute in this aspect by developing this software with a free license2 . 2 Architecture DRAMNERI states for Dictionary, Rule-based And Multilingual Named En- tity Recognition Implementation. It is a multiplatform3 NER system written in C++. It is organized as a sequential set of modules with a high degree of ﬂexi- bility, meaning that some modules may be used or not depending on the input4 . Moreover, most of the actions it performs, and the dictionaries and rules it uses are conﬁgurable by using parameter ﬁles. The main modules of our system are brieﬂy outlined in the following subsections. Fig. 1. DRAMNERI architecture 1 Usually NER systems are limited to a closed set that usually includes Person, Loca- tion, Organization and time, date and numerical expressions 2 This software is distributed under the terms of the GNU General Public License (GPL) 3 It has been succesfully tested in GNU/Linux and Win32 but it should work in any system with a C++ compiler with STL support 4 For example, we could process text that it is already tokenized and/or already split- ted in sentences 2.1 Tokenizer The built-in tokenizer has been designed taking eﬃciency and simplicity in mind and it can be used for correctly punctuated common texts in languages with latin codings. If the program is used for unusual domains or for languages with diﬀerent codings or with other demands on tokenization, then an external module should be used instead of this one. There are free tokenizers which can deal with this task, such as the tokenizer included in Freeling . 2.2 Sentence Splitter We split the sentences using an algorithm based on the method used for this task in the EXIT system . For every token in the text that can delimit a sentence (i.e. dot, question mark) the two preceding and the following token are considered. With this context information, some rules and dictionaries are applied to decide wether or not the token is an end of sentence. 2.3 Named Entity Identiﬁcation This task is applied for each sentence in the given text and its goal is to identify the named entities that appear in the text. We use regular expressions to do this. Groups of tokens that match any NEI regular expression jointed by prepo- sitions5 are detected and identiﬁed as generic entities. The maximum number of prepositions between two tokens that match any NEI regular expression and the list of prepositions to consider are conﬁgurable. For example, if we have ’de’ and ’la’ in the preposition list and the maxi- mum number of prepositions between identiﬁed tokens is 1, then the string “en la Universidad de Alicante” would be identiﬁed as “en la <ENTITY> Univer- sidad de Alicante </ENTITY>” but “Pedro de la Viuda” would be identiﬁed as “<ENTITY> Pedro </ENTITY> de la <ENTITY> Viuda </ENTITY>” instead of “<ENTITY> Pedro de la Viuda </ENTITY>” because 1 is the maximum number of prepositions between identiﬁed tokens. 2.4 Named Entity Recognition The goal of this phase is to assign a category to each of the entities detected in the previous step. It should be noted that the boundaries of the identiﬁed entitities can be altered in this phase. In order to achieve the classiﬁcation, we take into account external and internal evidence , that is, we try to ﬁnd any information that help us to classify each entity by studying its left and right contexts and the entity itself respectively. We perform this two actions in a secuential way. These two processes are detailed like follows. 5 extracted from a preposition list speciﬁc to NEI Classiﬁcation using external evidence We use triggers gazetteers. By a trigger we mean a word or collection of words that appear before or after an entity and that determines its classiﬁcation type. For trigger driven classiﬁcation length-conﬁgurable left and right contexts of the identiﬁed entity are considered. Within these contexts front triggers and back triggers gazetteers are applied respectively. If any happens to be found then the entity is classiﬁed with the category of the gazetteer that the matching trigger belongs to. For example, if we have the string “calle <ENTITY> Mayor </ENTITY>” and calle is a location trigger, then ”calle Mayor” is classiﬁed as a location entity. The output string would be “<ENTITY type=LOCATION> calle Mayor </ENTITY>”. If we classiﬁy an entity with a front trigger, then, we try to extend the entity classiﬁed by using rules, which follow the standard egrep syntax. An example follows. rule: ^no [0-9]+ identified entity: "calle <ENTITY> Mayor </ENTITY> no 27" ”calle Mayor” is classiﬁed as an address using triggers. After that, it is ex- tended because no 27 matches a rule. Thus, the ﬁnal entity classiﬁed with type address is ”calle Mayor no 27” Classiﬁcation using internal evidence For this classiﬁcation we use two resources: gazetteers and rules. As the rules used for trigger driven classiﬁcation these ones follow the standard egrep syntax. These one also may contain elements that refer to gazetteers. Each rule is linked to an entity category. This way, if a rule matches an entity then the category assigned is that linked to the rule. It follows an example to match ﬁrst name and surname in Spanish: rule: ^PER (PER)? ((PREP)? (PREP)? PER) category: PER This rule matches and entity that consists of a token which is in the Person gazetteer, followed by a token present in the prepositions gazetteer, etc. If an entity matches then it is assigned the category PER. Example strings that would e match are ”Alberto P´rez” and ”Pedro Mario de la Viuda”. 3 Application to speciﬁc tasks We have not directly evaluated DRAMNERI as it is intended to be a generic tool that can be adapted to speciﬁc tasks. Thus, we present two tasks in which we have applied DRAMNERI. Firstly, we outline the use of this NER system in a Information Extraction system whose domain consists of notarial documents. Secondly, we discuss the use of DRAMNERI in a Question Answering (QA) system. In the ﬁrst task we had a collection of notarial documents from which we wanted to build templates containing the relevant data in a sorted way. For do- ing this, DRAMNERI was applied after a preprocess and before a postprocess. In the preprocess we divided the documents in several sections because the rel- evant data changed from section to section. For each section we built adapted gazetteers and rules to apply DRAMNERI. Finally, we applied a postprocess to the classiﬁed entities of each section. Roughly, it consisted in ﬁlling templates taking into account the diﬀerent entity categories, the order in which the entities where classiﬁed and so on. A output template is shown as an example: <file name="10500006.doc.txt"> <section name="comparece" config_file="config_persona.txt"> <ENT TYPE=PER NUM=1>PEDRO PEREZ LOPEZ</ENT> <ENT TYPE=PER NUM=2>JOSEFA PEREZ</ENT> u <ENT TYPE=DIR REF=1-2>calle Zepelin n´mero 5 3o </ENT> <ENT TYPE=ID-NIF REF=1>25526996-S</ENT> <ENT TYPE=ID-NIF REF=2>69962777-U</ENT> </section> <section name="interviene" config_file="config_persona.txt"> e ı a <ENT TYPE=NOT NUM=1>Jos´ Mar´a Gonz´lez</ENT> <ENT TYPE=ORG NUM=2>Banco Santander Central S.A</ENT> </section> <section name="expone" config_file="config_objeto.txt"> <ENT TYPE=OBJ NUM=1>URBANA.- CATORCE.- Piso</ENT> <ENT TYPE=DIR REF=1>calle Sevilla</ENT> <ENT TYPE=RCT REF=1>9932023UT2959S0050PS</ENT> </section> </file> In our second task we applied DRAMNERI to Question Anwering . The aim of a QA system is to ﬁnd a speciﬁc answer to a given query in a collection of documents. These systems are usually made up of an Information Retrieval (IR) module and a QA algorithm. The IR module retrieves the most relevant documents from the collection to a given query, and QA is applied only to these documents, as its computational cost is quite high. Our approach consisted in applying NER between IR and QA. This way we ﬁltered all the relevant documents that IR returned that did not have any entity that belonged to the same category as the answer. I.e. if the query were ”Who is the president of France?” then the answer category is Person and thus, we would ﬁlter all the relevant documents in which no Person entity was found. By applying NER we achieved a 26% data reduction and moreover, we increased a 9% the data relevance. 4 Conclusions and Future Work We have fulﬁled our main objective. That is, to have an easily customizable NER system that could perform well detecting complex entities in speciﬁc domains. Besides, we think that providing a NER tool such as this one as free software could help the research community to focus on investigating new algorithms and techniques rather than to spend time implementing again and again basic algorithms that have little to contribute to the ﬁeld. One line of future research could consist of investigating how to combine this NER system with a learning based NER system in order to improve the results of the later. Another line of research would be to determine how the addition of language information (lexic, morfologic and sintactic) could improve the performance of our NER system. Acknowledgements This research has been partially funded by the Valencia Government under project number GV04B-268. References 1. A. Borthwick. A Maximum Entropy Approach to Named Entity Recognition. PhD thesis, New York University, September 1999. o 2. X. Carreras, I. Chao, L. Padr´, and M. Padro. Freeling: An Open-Source Suite of Language Analyzers. In Proceedings of the 4th LREC Conference, 2004. 3. N. Chinchor. Overview of muc-7. In Proceedings of the Seventh Message Under- standing Conference (MUC-7), 1998. 4. D. Mc Donald. Internal and external evidence in the identiﬁcation and semantic categorization of proper names. Corpus Processing and Lexical Acquisition, 1996. n 5. R. Mu˜oz and M. Palomar. Sentence Boundary and Named Entity Recognition in EXIT system: Information extraction system of notarial texts. In Proceedings of IV Int. Conference on Artiﬁcial Intelligence and Emerging Technologies in Accounting, 1998. n 6. Antonio Toral, Elisa Noguera, Fernando Llopis, and Rafael Mu˜oz. Improving ques- tion answering using named entity recognition. In Proceedings of the 10th Inter- national Conference of Applications of Natural Language to Information Systems, pages 181–191, 2005. 7. Ian Witten, Eibe Frank, and Morgan Kaufmann. Data Mining: Practical machine learning tools with Java implementations. 2000.
Pages to are hidden for
"DRAMNERI: a free knowledge based tool to Named Entity Recognition"Please download to view full document