DRAMNERI: a free knowledge based tool to Named Entity Recognition by cgq15394


classified-template pdf

More Info
									    DRAMNERI: a free knowledge based tool to
         Named Entity Recognition

                                   Antonio Toral

    Grupo de investigaci´n en Procesamiento del Lenguaje Natural y Sistemas de
               Departamento de Lenguajes y Sistemas Inform´ticos
                            University of Alicante, Spain

       Abstract. In this paper we present DRAMNERI, a free software appli-
       cation which uses rules and gazetteers in order to perform Named Entity
       Recognition. This system is fully customizable to any specific domain
       and it is multilingual. It has succesfully been applied in a domain spe-
       cific Information Extraction system and in a Question Answering task.

1    Introduction

Named Entity Recognition (NER) is nowadays an important task for the reso-
lution of other problems of higher complexity, like Information Retrieval (IR),
Information Extraction (IE) or Question Answering (QA), among others.
    In spite of this, NER was initially only used as a subtask of IE. This is the
Natural Language Processing (NLP) task that consists in retrieving relevant
information from non structured texts and producing as a result a structured
set of data, usually refered as templates. Several subtasks are applied in order
to achieve this goal. As we have already pointed out, one of these is NER. As
defined in the Message Understading Conference [3], NER consists in identifying
and categorizing entity names wich can include also temporal and/or numerical
    As in other NLP techniques, there are two approaches to NER [1]. One
is based in knowledge while the other uses a supervised learning algorithm.
Regarding resources, the first usually uses gazetteers and rules whereas the later
needs an annotated corpus. The knowledge based model obtains good results
in specific domains, as the gazetteers can be adapted very precisely, and it is
able to detect complex entities, as the rules can be tailored to meet nearly any
requirement. However, if we deal with a non restricted domain, it is better to
use the learning approach, as it would be very tedious and time consuming to
build rules and gazetteers in this case.
    Because our aim is to classify complex entities in restricted texts, we have
adopted the knowledge model. We also wanted our system to be highly flexible
and adaptable. That is why we have made almost all possible parameters cus-
tomizable (i.e. dictionaries to use, entity categories, length of contexts, etc). This
way the system can be easily configured to work with different languages and
domains. Moreover, this way it can deal with an open set of entity categories1 .
   Regarding software licenses, we would like to point out that we strongly
agree with Freeling [2] and Weka [7] developers that the free availability of basic
NLP tools would speed up progress in our area of reasearch. Thus, we modestly
contribute in this aspect by developing this software with a free license2 .

2     Architecture

DRAMNERI states for Dictionary, Rule-based And Multilingual Named En-
tity Recognition Implementation. It is a multiplatform3 NER system written in
C++. It is organized as a sequential set of modules with a high degree of flexi-
bility, meaning that some modules may be used or not depending on the input4 .
Moreover, most of the actions it performs, and the dictionaries and rules it uses
are configurable by using parameter files. The main modules of our system are
briefly outlined in the following subsections.

                            Fig. 1. DRAMNERI architecture

    Usually NER systems are limited to a closed set that usually includes Person, Loca-
    tion, Organization and time, date and numerical expressions
    This software is distributed under the terms of the GNU General Public License
    It has been succesfully tested in GNU/Linux and Win32 but it should work in any
    system with a C++ compiler with STL support
    For example, we could process text that it is already tokenized and/or already split-
    ted in sentences
2.1     Tokenizer

The built-in tokenizer has been designed taking efficiency and simplicity in mind
and it can be used for correctly punctuated common texts in languages with
latin codings. If the program is used for unusual domains or for languages with
different codings or with other demands on tokenization, then an external module
should be used instead of this one. There are free tokenizers which can deal with
this task, such as the tokenizer included in Freeling [2].

2.2     Sentence Splitter

We split the sentences using an algorithm based on the method used for this
task in the EXIT system [5]. For every token in the text that can delimit a
sentence (i.e. dot, question mark) the two preceding and the following token
are considered. With this context information, some rules and dictionaries are
applied to decide wether or not the token is an end of sentence.

2.3     Named Entity Identification

This task is applied for each sentence in the given text and its goal is to identify
the named entities that appear in the text. We use regular expressions to do
this. Groups of tokens that match any NEI regular expression jointed by prepo-
sitions5 are detected and identified as generic entities. The maximum number of
prepositions between two tokens that match any NEI regular expression and the
list of prepositions to consider are configurable.
    For example, if we have ’de’ and ’la’ in the preposition list and the maxi-
mum number of prepositions between identified tokens is 1, then the string “en
la Universidad de Alicante” would be identified as “en la <ENTITY> Univer-
sidad de Alicante </ENTITY>” but “Pedro de la Viuda” would be identified
as “<ENTITY> Pedro </ENTITY> de la <ENTITY> Viuda </ENTITY>”
instead of “<ENTITY> Pedro de la Viuda </ENTITY>” because 1 is the
maximum number of prepositions between identified tokens.

2.4     Named Entity Recognition

The goal of this phase is to assign a category to each of the entities detected
in the previous step. It should be noted that the boundaries of the identified
entitities can be altered in this phase. In order to achieve the classification, we
take into account external and internal evidence [4], that is, we try to find any
information that help us to classify each entity by studying its left and right
contexts and the entity itself respectively. We perform this two actions in a
secuential way. These two processes are detailed like follows.
    extracted from a preposition list specific to NEI
Classification using external evidence We use triggers gazetteers. By a
trigger we mean a word or collection of words that appear before or after an
entity and that determines its classification type. For trigger driven classification
length-configurable left and right contexts of the identified entity are considered.
Within these contexts front triggers and back triggers gazetteers are applied
respectively. If any happens to be found then the entity is classified with the
category of the gazetteer that the matching trigger belongs to.
    For example, if we have the string “calle <ENTITY> Mayor </ENTITY>”
and calle is a location trigger, then ”calle Mayor” is classified as a location
entity. The output string would be “<ENTITY type=LOCATION> calle Mayor
    If we classifiy an entity with a front trigger, then, we try to extend the entity
classified by using rules, which follow the standard egrep syntax. An example

rule: ^no [0-9]+
identified entity: "calle <ENTITY> Mayor </ENTITY> no 27"

   ”calle Mayor” is classified as an address using triggers. After that, it is ex-
tended because no 27 matches a rule. Thus, the final entity classified with type
address is ”calle Mayor no 27”

Classification using internal evidence For this classification we use two
resources: gazetteers and rules. As the rules used for trigger driven classification
these ones follow the standard egrep syntax. These one also may contain elements
that refer to gazetteers. Each rule is linked to an entity category. This way, if a
rule matches an entity then the category assigned is that linked to the rule. It
follows an example to match first name and surname in Spanish:

rule: ^PER (PER)? ((PREP)? (PREP)? PER)
category: PER

    This rule matches and entity that consists of a token which is in the Person
gazetteer, followed by a token present in the prepositions gazetteer, etc. If an
entity matches then it is assigned the category PER. Example strings that would
match are ”Alberto P´rez” and ”Pedro Mario de la Viuda”.

3   Application to specific tasks

We have not directly evaluated DRAMNERI as it is intended to be a generic
tool that can be adapted to specific tasks. Thus, we present two tasks in which
we have applied DRAMNERI. Firstly, we outline the use of this NER system in
a Information Extraction system whose domain consists of notarial documents.
Secondly, we discuss the use of DRAMNERI in a Question Answering (QA)
    In the first task we had a collection of notarial documents from which we
wanted to build templates containing the relevant data in a sorted way. For do-
ing this, DRAMNERI was applied after a preprocess and before a postprocess.
In the preprocess we divided the documents in several sections because the rel-
evant data changed from section to section. For each section we built adapted
gazetteers and rules to apply DRAMNERI. Finally, we applied a postprocess to
the classified entities of each section. Roughly, it consisted in filling templates
taking into account the different entity categories, the order in which the entities
where classified and so on. A output template is shown as an example:
<file name="10500006.doc.txt">
 <section name="comparece" config_file="config_persona.txt">
  <ENT TYPE=DIR REF=1-2>calle Zepelin n´mero 5 3o </ENT>
  <ENT TYPE=ID-NIF REF=1>25526996-S</ENT>
  <ENT TYPE=ID-NIF REF=2>69962777-U</ENT>
 <section name="interviene" config_file="config_persona.txt">
                         e     ı     a
  <ENT TYPE=NOT NUM=1>Jos´ Mar´a Gonz´lez</ENT>
  <ENT TYPE=ORG NUM=2>Banco Santander Central S.A</ENT>
 <section name="expone" config_file="config_objeto.txt">
  <ENT TYPE=DIR REF=1>calle Sevilla</ENT>
  <ENT TYPE=RCT REF=1>9932023UT2959S0050PS</ENT>
    In our second task we applied DRAMNERI to Question Anwering [6]. The
aim of a QA system is to find a specific answer to a given query in a collection
of documents. These systems are usually made up of an Information Retrieval
(IR) module and a QA algorithm. The IR module retrieves the most relevant
documents from the collection to a given query, and QA is applied only to
these documents, as its computational cost is quite high. Our approach consisted
in applying NER between IR and QA. This way we filtered all the relevant
documents that IR returned that did not have any entity that belonged to the
same category as the answer. I.e. if the query were ”Who is the president of
France?” then the answer category is Person and thus, we would filter all the
relevant documents in which no Person entity was found. By applying NER
we achieved a 26% data reduction and moreover, we increased a 9% the data

4   Conclusions and Future Work
We have fulfiled our main objective. That is, to have an easily customizable NER
system that could perform well detecting complex entities in specific domains.
    Besides, we think that providing a NER tool such as this one as free software
could help the research community to focus on investigating new algorithms
and techniques rather than to spend time implementing again and again basic
algorithms that have little to contribute to the field.
    One line of future research could consist of investigating how to combine
this NER system with a learning based NER system in order to improve the
results of the later. Another line of research would be to determine how the
addition of language information (lexic, morfologic and sintactic) could improve
the performance of our NER system.

This research has been partially funded by the Valencia Government under
project number GV04B-268.

1. A. Borthwick. A Maximum Entropy Approach to Named Entity Recognition. PhD
   thesis, New York University, September 1999.
2. X. Carreras, I. Chao, L. Padr´, and M. Padro. Freeling: An Open-Source Suite of
   Language Analyzers. In Proceedings of the 4th LREC Conference, 2004.
3. N. Chinchor. Overview of muc-7. In Proceedings of the Seventh Message Under-
   standing Conference (MUC-7), 1998.
4. D. Mc Donald. Internal and external evidence in the identification and semantic
   categorization of proper names. Corpus Processing and Lexical Acquisition, 1996.
5. R. Mu˜oz and M. Palomar. Sentence Boundary and Named Entity Recognition in
   EXIT system: Information extraction system of notarial texts. In Proceedings of IV
   Int. Conference on Artificial Intelligence and Emerging Technologies in Accounting,
6. Antonio Toral, Elisa Noguera, Fernando Llopis, and Rafael Mu˜oz. Improving ques-
   tion answering using named entity recognition. In Proceedings of the 10th Inter-
   national Conference of Applications of Natural Language to Information Systems,
   pages 181–191, 2005.
7. Ian Witten, Eibe Frank, and Morgan Kaufmann. Data Mining: Practical machine
   learning tools with Java implementations. 2000.

To top