A Tool for the Automatic and Manual Annotation of Biomedical Documents

Document Sample
A Tool for the Automatic and Manual Annotation of Biomedical Documents Powered By Docstoc
					    A Tool for the Automatic and Manual Annotation of Biomedical

                      Anália Lourenço1, Sónia Carneiro1, Rafael Carreira2,
                       Miguel Rocha2, Isabel Rocha1, Eugénio Ferreira1
                         IBB - Institute for Biotechnology and Bioengineering, Center
                                           of Biological Engineering
                                       Department of Informatics / CCTC
                                              University of Minho
                            Campus de Gualtar, 4710-057 Braga – PORTUGAL

                                                           (Ananiadou et al., 2006; Natarajan et al., 2005;
                     Abstract                              Erhardt et al., 2006).
                                                              The research field of BioTM emerged from
    The techniques developed within the field              this need and has been providing for helpful
    of Biomedical Text Mining (BioTM) have                 computerised approaches. In particular, Bio-
    been mainly tested and evaluated over a set            medical Named Entity Recognition (BioNER),
    of known corpora built by a few researchers            the field that deals with the unambiguous identi-
    with a specific goal or to support scientific
                                                           fication of named entities (such as names of
    competitions. The generalized use of
    BioTM software therefore requires that an              genes, proteins, gene products, organisms, drugs,
    enlarged set of corpora is made available              chemical compounds, etc.), is the key step for
    covering a wider range of biomedical re-               accessing and integrating the information stored
    search topics. This work proposes a soft-              in the literature (Zweigenbaum et al., 2007; Jen-
    ware tool that facilitates the task of building        sen et al., 2006; Natarajan et al., 2005).
    a BioTM corpus by providing a user-                       Techniques for term identification are becom-
    friendly and interoperable tool that allows            ing widely used in biomedical research. Lexical
    both automatic and manual annotation of                resources (Fundel and Zimmer, 2006; Mukherjea
    biomedical documents (supporting both ab-              et al., 2004; Kou et al., 2005; Muller et al., 2004)
    stracts and full text). This tool is also inte-
                                                           and rule-based systems (Hu et al., 2005; Hanisch
    grated in a more comprehensive BioTM
    framework.                                             et al., 2005) deliver some degree of automation.
                                                           On the other hand, Machine Learning contribu-
                                                           tions (Okazaki and Ananiadou, 2006; Kou et al.,
1 Introduction                                             2005; Shi and Campagne, 2005; Yeganova et al.,
                                                           2004; Sun et al., 2006) address issues like term
Semantic annotation, sometimes called concept              novelty, synonymy (including term variants and
matching in the biomedical literature, is the              abbreviations) and homonymy.
process of mapping phrases within a source text               Despite current achievements, technique de-
to distinct concepts defined by domain experts.            velopment and usage are constrained by the lim-
   Traditionally, such annotation was exclusively          ited availability of high-quality training corpora.
manual. However, the growing scientific publica-           In fact, at this point, biomedical annotated cor-
tion rate, the continuous evolving of biological           pora represent a bottleneck in the development of
terminology and the more complex analysis re-              BioTM software. Existing approaches cannot be
quirements brought by systems-level approaches             extended without the production of corpora, con-
urge for automated curation processes                      veniently validated by domain experts.

   In this work, a contribution to tackle this mat-           The construction of a new corpus implies the
ter is provided, with the development of a novel           laborious and time-consuming manual collection
interoperable and user-friendly software applica-          and annotation of a significant number (typically
tion that supports manual curation of biomedical           hundreds) of documents. It is not straightforward
documents. The proposed software implements a              to gather, organise and annotate a valuable set of
workflow where a biomedical corpus is auto-                documents. On the one hand, the set of docu-
matically annotated based on a specialised dic-            ments has to be representative of the domain it is
tionary. The discovered biomedical concept                 supposed to describe, i.e., it has to embrace the
output is then directed into a manual curation             terminological trends that characterise the do-
stage, and finally a high-quality biomedical an-           main, while establishing a contrast towards other
notated corpus is made available.                          domains. On the other hand, annotation has to be
   Both the automatic and manual annotation                as comprehensible and consensual as possible.
tasks are envisioned to be flexible, allowing the          According to a given annotation schema, differ-
tagging of many biological entity classes and the          ent annotators should be able to agree, producing
creation and use of different dictionaries, ex-            similar outputs. Otherwise, either the annotation
tracted from major biomedical databases. Al-               schema is not able to reflect the domain conven-
though we have our own annotation schema, the              iently, or the domain requires further annotation
software is expected to be useful within other             rules that prevent contradicting or misleading
domains which have domain-specific resources               outputs.
available. In other words, if a new annotation                It is not reasonable to acknowledge the need
schema is defined and the dictionary builders              for corpora without devising computational an-
cope with it, both automatic and manual annota-            notation tools. There exist several manual text
tion are granted.                                          annotation tools for creating annotated corpora.
   The remainder of this paper starts by placing           General-purpose annotation tools such as Cal-
annotation tools within BioTM scenario, estab-             listo2, WordFreak3(Morton and LaCivita, 2003),
lishing basic requirements and identifying related         the General Architecture for Text Engineering
work. The enumeration of the software develop-             (GATE4) (Cunningham et al., 2002) and
ment aims follows. Next, the main features of the          MMAX25 are references in the area. However,
proposed software application are discussed,               these tools present limited flexibility and its ‘out
namely the creation of particular dictionaries, the        of the box’ usage often demands expert pro-
default annotation schema, the automatic annota-           gramming skills.
tion module and user-friendly manual annotation               Although offering customisable tasks (for ex-
environment. Final remarks provide an overall              ample, a simple annotation schema can be de-
perspective of the work and identify new fea-              fined with an XML DTD), these tools do not
tures.                                                     offer any support for biology-related natural lan-
                                                           guage processing. Dedicated tools such as POS
2 The Role of Annotation Tools in                          taggers, parsers and named entity recognisers are
   BioTM                                                   becoming widely available and it would be desir-
                                                           able to include them into annotation tools.
Emerging efforts in BioTM agree on considering                Tools should support semantic annotation by
manually annotated biomedical corpora as price-            hand and some form of automatic annotation (us-
less resources (Kim et al., 2008; Kim et al.,              ing available resources such as dictionaries, on-
2003). Many researchers openly contribute and              tologies, templates or user-specified rules).
disseminate annotated corpora such as GENIA                Moreover, by supporting both syntactic and se-
(Kim et al., 2003), PennBioIE (Kulick S et al.,            mantic annotation, a wide variety of annotation
2004) or GENETAG (Tanabe et al., 2005). Also,              schemas can be defined and used. New annota-
there are datasets coming from knowledgeable               tion tasks can be built without writing new soft-
challenges such as BioCreAtive1. Yet, adaptation           ware or creating specialised configuration files.
of available resources to new problems (real-
world scenarios) usually requires substantial ef-          3 Development Aims
forts, since they have been designed to meet a
particular aim and tend not to comply with any
common data format.                                          http://callisto.mitre.org/
1                                                          5
    http://biocreative.sourceforge.net/                      http://mmax.eml-research.de/

The development of our biomedical annotation                     journal, abstract and the location of major
tools was driven by two important needs, essen-                  sections are tagged), tokenisation and stop-
tial for creating useful text corpora: i) accuracy               word removal;
and consistency of the annotations, and ii) usabil-            • a default annotation schema embracing all
ity of the data. The major aim of this work is                   major biological entity classes (genes, pro-
therefore two-fold: i) to provide a friendly envi-               teins, compounds and organisms) and some
ronment for curators and ii) to take advantage of                uncommon, although valuable classes
the multiple informational resources available,                  (laboratory techniques and physiological
enhancing the annotation process as much as                      states);
possible.                                                      • a lexicon-based biomedical annotator which
   In this sense, the baseline requirements of our               supports the construction of customised dic-
tools were interoperability with other                           tionaries as well as user-defined rules and
tools/modules and flexibility in terms of annota-                lookup tables;
tion schemas and data exchange formats. Anno-                  • an user-friendly annotation viewer based on
tation schemas should be made as general as                      Cascade Style Sheets (CSS) that allows the
possible, covering major biomedical classes and                  user to verify and correct annotations and
thus, enabling (partial) schema interchange.                     refine dictionary contents.
Also, document annotation may comprise both
syntactic (POS information) and semantic anno-                Additionally, it is important to note that unlike
tations (BioNER information).                              many previous approaches our tools are able to
   The main aim of the annotation environment              handle both abstracts and full text documents
presented here is to provide common text proc-             indistinctively. The latter will undoubtedly give
essing modules and to enable automatic and                 an increasing amount of useful information in
manual document annotation. The text process-              most cases.
ing pipeline was modelled with minimal assump-
tions on their dependences and application                 4.1     Lexical Resources
ordering. Tokenisation, sentence splitting and
stopword removal are the basic text processing             The tool supports two kinds of lexical resources:
steps, and typically do not rely on previous pre-          lookup tables and dictionaries. The authors have
processing, whereas chunk parsing as well as               prepared lookup lists of standard laboratory
BioNER may be based on POS annotation. Not                 techniques and general physiological states.
only the tools should be able to deal with multi-          Also, the user may create general or particular
layer annotation, as annotation processes should           dictionaries from major biomedical databases
not have precedence over one another, i.e. se-             such as BioCyc6, UniProt7 or ChEBI8 and inte-
mantic annotation may occur after or before POS            grated databases such as Biowarehouse9 (Figure
tagging.                                                   1). Each data source is characterised in terms of
   Furthermore, neither automatic nor manual               the embraced biological classes and organism (if
annotation processes are considered mandatory.             it is a multi-organism source). The user may de-
Typically, manual annotation is time-consuming             cide to include all contents or select just a few,
and should be considered a later step, accounting          depending on the purpose of the dictionary.
for false positive matches (term homonymy) and                 Database copyrights are preserved as there is
miss annotations (term synonymy and term nov-              no content distribution with the tool. In order to
elty). However, it is up to the user to decide             deploy any loader, the user has to download the
whether to trigger one or the two processes.               contents from the corresponding source.

4 Implementation
The implementation of our tools devised the fol-
lowing components/modules:
  • an input/output module enabling the con-
     version of documents for common file for-
     mats (such as PDF and HTML) to plain
     text;                                                   http://biocyc.org/
  • a pre-processing module embracing XML-                   http://www.uniprot.org/
     based text structuring (the title, authors,           9

            Figure 1. Deploying the construction of a new dictionary using available data loaders.

   On the other hand, all created resources are            catalysed by certain enzymes, which in turn are
kept in relational format (currently, on MySQL             encoded by the respective genes. Besides com-
database engine) and thus, allow eventual shar-            mon annotation, this schema also supports anno-
ing.                                                       tation linking to lexical resources (Figure 2), i.e.,
                                                           it identifies the dictionary entry that triggered
4.2   Annotation Schemas                                   each tagging as well as the normalised term (the
                                                           “concept label” that gathers together known vari-
The default semantic annotation schema was cre-            ants and synonyms of a given term).
ated by the authors and aims at tracking down                 The ability to use other annotation schemas is
major biological entities. Currently, the system           considered a premise of tool interoperability and
accounts for a total of 14 biological classes as           data re-use. As such, annotation schemas derived
follows:                                                   from the GENIA ontology (Kim et al., 2003), a
    • gene                                                 formal model of cell signaling reactions in hu-
        • metabolic gene                                   man, or used in challenges such as Biocreative,
        • regulatory gene                                  often referenced by the research community as
    • protein                                              gold standards, were accounted for. It is possible
        • transcription factor                             to choose which schema to use on a given anno-
        • enzyme                                           tation task and also to translate from one schema
    • pathway                                              to another. Additionally, we devise the incorpo-
    • reaction                                             ration of new schemas as long as the user speci-
    • compound                                             fies tagging and mapping functions.
    • organism                                                Regarding POS, the premise is similar and
    • DNA                                                  thus, we chose to incorporate GATE for the de-
    • RNA                                                  velopment language processing components.
    • physiological state                                  GATE provides a reusable design and a set of
    • laboratory technique                                 prefabricated software building blocks (namely
                                                           tokenizers, sentence splitters and POS taggers)
   This schema allows the user to identify mo-             that can be used, extended and customised for
lecular entities that may describe different levels        specific needs. Also, its component-based model
of biological organisation and thus, lead to a bet-        allows for easy coupling and decoupling of the
ter insight in functional description of cellular          processors, thereby facilitating comparison of
processes.                                                 alternative configurations or different implemen-
   For instance, a physiological state is fre-             tations of the same module (e.g., different pars-
quently characterised by particular level of de-           ers). At Figure 2, we illustrate an example of
fined biological entities, like compounds                  POS tagging output.

Figure 2. Small piece of an annotated document using the default annotation schema and GATE default POS

                 Figure 3. Configuring the automated lexical-based BioNER process.

                                Figure 4. Snapshot of the manual annotation environment.

                                                              (less than 3-character long). Annotation gives
4.3       Automatic Annotation                                preference to longest term matching, tracking up
                                                              to hepta-grams (i.e. 7-word composition).
The conversion of source formats into plain text                 Additional patterns account for previously un-
is carried out by freeware programs such as                   known terms and term variants. For example, the
Xpdf10 (Windows or Linux) and pdftotext11 (Mac                template ”([a-z]{3}[A-Z]+\d*)” (a sequence of
OS). The process of XML-oriented document                     three lower-case letters followed by an upper-
structuring was implemented by the authors us-                case letter and a sequence of zero or more digits)
ing simple pattern matching. Documents (ab-                   is used to identify candidate gene names while
stracts or full-texts) are submitted to tokenising            the categorical nouns ”ase” and ”mRNA” track
and stopword removal processes, implemented                   down possible enzyme and RNA mentions, re-
using      Lingua::PT::PLNbase        and     Lin-            spectively. Besides class identification, the sys-
gua::StopWords Perl modules, respectively.                    tem also sustains term normalisation, grouping
   Following the pre-processing step, lexicon-                all term variants around a “common name” for
based BioNER is sustained by a specialised re-                visualisation and statistical purposes.
writing system developed by the authors upon
the Text::RewriteRules Perl module. The user                  4.4   Manual Annotation
specifies the supporting dictionary and the set of
biological classes to be annotated (Figure 3).                   The manual annotation environment accounts
Lookup tables and general templates may also be               for the review of automatic annotations by ex-
included. Furthermore, the process can be de-                 perts and the enhancement of the lexical re-
ployed over abstracts or full-texts.                          sources. Also, manually curated documents are
   The system attempts to match terms against                 intended to be further used as training corpora to
dictionary and lookup table contents, checking                build annotation, classification or other general-
for different term variants (e.g. hyphen and apos-            ised learning models regarding biomedical con-
trophe variants) and excluding too short terms                tents.
                                                                 Although the actual corpus file with annota-
     http://www.foolabs.com/xpdf/                             tion is encoded in XML, the annotators work on

a CSS-styled view which is much more user-                  biomedical entity classes (genes, proteins, com-
friendly (Figure 4). Furthermore, a query view is           pounds and organisms) and some uncommon,
used to depict the relation of the annotated terms          although valuable classes (laboratory techniques
with dictionary entries.                                    and physiological states); the ability to use stan-
   When the user revises dictionary-based anno-             dard annotation schemas such as GENIA; a pre-
tation and corrects or adds annotations, the dic-           processing module capable of converting docu-
tionary is updated with such previously unknown             ments from common file formats (such as PDF
or mischaracterised information. Therefore, this            and HTML) to plain text and then, tokenise and
process has two major outputs: high-quality an-             remove stopword from such texts; a lexicon-
notation and dictionary enrichment. The latter is           based biomedical annotator for annotating bio-
a classical example of a process of learning by             medical texts which allows the construction of
experience that accounts for well-known biologi-            customised dictionaries as well as user-defined
cal issues such as term novelty, term synonymy              rules and lookup tables; a user-friendly annota-
and term homonymy. Term novelty and the asso-               tion view that allows the user to verify and cor-
ciation of synonyms are far from being adequate-            rect annotations and refine dictionary contents.
ly tackled as they will depend on expert’s                     The tool can be used as a stand-alone envi-
knowledge, which is limited and often outdated              ronment or it can be integrated in a more com-
just like dictionaries. However, the disambigua-            prehensive BioTM framework. Currently, it is
tion of distinct mentions using the same term               incorporated in the @Note Biomedical Text
(e.g. same gene, protein and RNA name) is a                 Mining workbench12 (Lourenço et al., 2008).
classical example where manual curation is inva-            Here, tool interoperability enables automatic in-
luable.                                                     formation retrieval (PubMed keyword-based
   Also, users may cooperate on curation tasks,             query and document retrieval from open-access
sharing locally processed documents and taking              and subscribed web-accessible journals) as well
advantage of dictionaries that have been refined            as mining experiments (using annotated corpora
by other users.                                             to construct BioNER models).
                                                               Future work includes the enhancement of an-
5 Conclusions                                               notation skills based on curator suggestions and
                                                            the implementation of several measures to mini-
The need for user-friendly and interoperable se-            mize discrepancies of inter-annotation and main-
mantic annotation tools is indisputable in                  tain the quality of annotation. Semantic type
BioTM. Research benefits greatly from the re-               checking and detection of anomalies in the re-
use of data (such as annotated corpora) and the             sulting annotations are devised as the first steps.
capacity to interchange tools (namely POS and                  The tools are freely available from
semantic taggers). However, this is only possible           http://sysbio.di.uminho.pt/anote.php.
if tools are devised for this purpose, i.e., if they
account for general annotation as well as annota-           Acknowledgments
tion interchange and if processing tools are pre-
pared to account for distinct annotation schemas.
On the other hand, annotation is a laborious and            This work is partly funded by the research pro-
time-consuming task that requires from the cura-            jects recSysBio (ref. POCI/BIO/60139/2004) and
tors both expertise on the subjects and critical            MOBioPro        (ref.   POSC/EIA/59899/2004)
judgment. In this sense, it is very important that          financed by the Portuguese Fundação para a
annotation tools take advantage of data mining              Ciência e Tecnologia. The work of Sónia
models and available knowledge resources,                   Carneiro is supported by a PhD grant from the
minimising manual curation efforts, and at the              Fundação para a Ciência e Tecnologia (ref.
same time, provide for a user-friendly environ-             SFRH/BD/22863/2005).
   In this work, a contribution to these issues is          References
provided, with the development of a novel inter-
operable and user-friendly software tool for bio-           S. Ananiadou, D. B. Kell and J. I. Tsujii (2006). Text
medical annotation. Its primary contributions are              mining and its potential applications in systems bi-
                                                               ology. Trends Biotechnol., 24, 571-579.
as follows: the ability to process abstract and
full-texts interchangeably; a basic semantic anno-
tation schema encompassing embracing all major              12

H. Cunningham, D. Maynard, K. Bontcheva and V.                 H. M. Muller, E. E. Kenny and P. W. Sternberg
   Tablan (2002). GATE: A Framework and Graphi-                   (2004). Textpresso: An ontology-based information
   cal Development Environment for Robust NLP                     retrieval and extraction system for biological litera-
   Tools and Applications. In Proceedings of the 40th             ture. Plos Biology, 2, 1984-1998.
   Anniversary Meeting of the Association for Com-             J. Natarajan, D. Berrar, C. J. Hack and W. Dublitzky
   putational Linguistics (ACL'02).                               (2005). Knowledge discovery in biology and bio-
R. A. A. Erhardt, R. Schneider and C. Blaschke                    technology texts: A review of techniques, evalua-
   (2006). Status of text-mining techniques applied to            tion strategies, and applications. Critical Reviews
   biomedical text. Drug Discovery Today, 11, 315-                in Biotechnology, 25, 31-52.
   325.                                                        N. Okazaki and S. Ananiadou (2006). Building an
K. Fundel and R. Zimmer (2006). Gene and protein                  abbreviation dictionary using a term recognition
   nomenclature in public databases. BMC Bioinfor-                approach. Bioinformatics, 22, 3089-3095.
   matics, 7.                                                  L. Shi and F. Campagne (2005). Building a protein
D. Hanisch, K. Fundel, H. T. Mevissen, R. Zimmer                  name dictionary from full text: a machine learning
   and J. Fluck (2005). ProMiner: rule-based protein              term extraction approach. BMC Bioinformatics, 6,
   and gene entity recognition. BMC Bioinformatics,               88.
   6.                                                          C. J. Sun, Y. Guan, X. L. Wang and L. Lin (2006).
Z. Z. Hu, M. Narayanaswamy, K. E. Ravikumar, K.                   Biomedical named entities recognition using condi-
   Vijay-Shanker and C. H. Wu (2005). Literature                  tional random fields model. Fuzzy Systems and
   mining and database annotation of protein phos-                Knowledge Discovery, Proceedings, 4223, 1279-
   phorylation using a rule-based system. Bioinfor-               1288.
   matics, 21, 2759-2765.                                      L. Tanabe, N. Xie, L. H. Thom, W. Matten and W. J.
L. J. Jensen, J. Saric and P. Bork (2006). Literature             Wilbur (2005). GENETAG: a tagged corpus for
   mining for the biologist: from information retrieval           gene/protein named entity recognition. BMC Bioin-
   to biological discovery. Nature Reviews Genetics,              formatics, 6.
   7, 119-129.                                                 L. Yeganova, L. Smith and W. J. Wilbur (2004). Iden-
J. D. Kim, T. Ohta, Y. Tateisi and J. Tsujii (2003).              tification of related gene/protein names based on an
   GENIA corpus--semantically annotated corpus for                HMM of name variations. Computational Biology
   bio-textmining. Bioinformatics, 19 Suppl 1, i180-              and Chemistry, 28, 97-107.
   i182.                                                       P. Zweigenbaum, D. mner-Fushman, H. Yu and K. B.
J. D. Kim, T. Ohta and J. Tsujii (2008). Corpus anno-             Cohen (2007). Frontiers of biomedical text mining:
   tation for mining biomedical events from literature.           current progress. Briefings in Bioinformatics, 8,
   BMC Bioinformatics, 9.                                         358-375.
Z. Kou, W. W. Cohen and R. F. Murphy (2005).
   High-recall protein entity recognition using a dic-
   tionary. Bioinformatics, 21 Suppl 1, i266-i273.
Kulick S, Bies A, Liberman M, Mandel M, McDonald
   R, Palmer M, Schein A and Ungar L (2004). Inte-
   grated Annotation for Biomedical Information Ex-
   traction. NAACL/HLT Workshop on Linking
   Biological Literature,Ontologies and Databases:
   Tools for Users (pp. 61-68).
A. Lourenço, R. Carreira, S. Carneiro, P. Maia, D.
   Glez-Peña, F. Fdez-Riverola, E. C. Ferreira, I.
   Rocha and M. Rocha (2008). @Note: a flexible and
   extensible workbench for Biomedical Text Mining.
   Submitted to BMC Bioinformatics.
T. Morton and J. LaCivita (2003). WordFreak: an
   open tool for linguistic annotation. Proceedings of
   the 2003 Conference of the North American Chap-
   ter of the Association for Computational Linguis-
   tics    on    Human       Language      Technology:
   Demonstrations (pp. 17-18). NJ, USA: Association
   for Computational Linguistics Morristown.
S. Mukherjea, L. V. Subramaniam, G. Chanda, S.
   Sankararaman, R. Kothari, V. Batra, D. Bhardwaj
   and B. Srivastava (2004). Enhancing a biomedical
   information extraction system with dictionary min-
   ing and context disambiguation. Ibm Journal of
   Research and Development, 48, 693-701.


Shared By: