Learning Center
Plans & pricing Sign in
Sign Out

Automatic Information Extraction


  • pg 1
									                                  Automatic Information Extraction1
                             Elizabeth Boschee, Ralph Weischedel, Alex Zamanian
                                              BBN Technologies
                                                10 Moulton St.
                                         Cambridge, MA 02138 USA
                                 { eboschee, weischedel, azamania } @

                            Keywords: Information Extraction and Link Analysis, OSINT

    During the DARPA Evidence Extraction and                1    Introduction
    Link Discovery (EELD) Program, BBN devel-               BBN's primary goal in the EELD program was to dra-
    oped the SERIF system that extracts entities and        matically increase the accuracy of evidence extraction, in
    relations represented in an ontology from text,         particular by achieving at least 90% of human perform-
    e.g., from web pages, primarily containing news.        ance on extracting relations given the entities. An
    SERIF's combination of general linguistic mod-          equally important goal was that this high-performing sys-
    els trained on pre-existing corpora with domain         tem be easily extendable to new genres, domains, or lan-
    specific components trained for the particular          guages, all without the use of expert human effort.
    task allows powerful linguistic analysis tools to          BBN's approach to these challenges has four main ele-
    be efficiently applied to relation and event ex-        ments:
    traction in a new domain. Key to the new strat-           • Integrate statistical learning algorithms wherever
    egy was the intermediate use of propositions en-              feasible.
    capsulating the literal meaning of the text, from         • Analyze all text into propositions, i.e., relations
    which the target relations could then be derived.             among entities, where the relations (predicates) are
    SERIF has been applied to other evidence ex-                  literally stated in the text.
    traction tasks, including automated extraction of         • Map relations to the most specific relation in the on-
    organizational structure from message traffic and             tology for link discovery and pattern learning.
    open domain question answering. BBN was one               • Integrate general linguistic training (e.g., names and
    of the groups that achieved the DARPA EELD                    treebanks) and small domain training sets (e.g., re-
    program goal of 90% of human performance in                   lations in the ontology).
    evidence extraction. Using a hybrid of statistical         Our extraction engine, SERIF (Statistical Entity & Re-
    learning algorithms and handcrafted patterns,           lation Information Finding), embodies this approach. It
    SERIF achieved 93% of human performance in              uses trained statistical models both for the core linguistic
    extracting entities, events, and relations and 96%      analysis components, such as the parser, and for compo-
    of human performance in extracting relations            nents that depend more on the particular domain, e.g., the
    given entities and events. This surpassed our           model that predicts the semantic type of noun phrases.
    own initial goals, namely to achieve 80% of hu-         The models for general linguistic analysis can be trained
    man performance on entities with names, and             using large pre-existing training corpora, while the more
    90% on relations when given the entities.               domain-specific models, e.g., those for the specific target
                                                            relations, are trained on smaller batches of relevant train-
1                                                           ing data annotated for the purpose. For any new task, the
  This research was sponsored by the Defense Advanced
Research Projects Agency and managed by the Air Force       efficient combination of these pre-existing general lin-
Research Laboratory, Information Directorate under con-     guistic models with the new, smaller domain-specific
tract F30602-01-C-0204. The views and conclusions           components results in high performance (the primary
contained in this document are those of the authors and     goal) with relatively low transitional effort (the secon-
should not be interpreted as necessarily representing the   dary goal).
official policies, either expressed or implied of the De-      In this paper, we will focus on the aspects of this work
fense Advanced Research Projects Agency, the Air Force      involving relation extraction, specifically the develop-
Research Laboratory, or the United States Government.       ment of algorithms that learn to extract relations, which
                                                            increase both the performance and the adaptability of the
SERIF system. We will present three stages of this work            Our main strategy here, then, was the intermediate use
after a brief description of the overall information extrac-    of propositions encapsulating the literal meaning of the
tion task, beginning with the pattern-based system we           text, from which the target relations could then be de-
developed for the 2003 EELD evaluation, then exploring          rived. Rather than work directly from parse trees, the
our initial work with statistical relation extraction, and      syntactic connections between the entity mentions were
ending with our most recent work using discriminative           translated into sets of propositions that attempt to capture
models to extract relations.                                    the underlying predicate argument structure without try-
                                                                ing to resolve word sense ambiguity. For example, in the
2     Task Description                                          phrase "the government was attacked by the party", the
                                                                propositional form records that "party" is the logical sub-
The work was evaluated in the context of three different        ject and "government" the logical object of "attack",
task definitions: the 2002 ACE/EELD/TIDES task, the             without resolving whether what occurred was a speech or
2003 EELD task, and the 2004 ACE/TIDES task. All                a raid.
involve extracting both entities and relations. The entity         Specifically, these propositions take the form:
extraction components of these tasks target from five to                 predicate (role1: arg1, … , rolen: argn)
eleven entity types, including at least person, organiza-       where the predicate is typically a verb or noun. Argu-
tion, facility, location, and GPE (geo-political entity, e.g.   ments can be either an entity or another proposition. The
city, state/province, or country). Entity extraction sys-       most common roles include logical subject, logical ob-
tems identify and link all textual mentions of those enti-      ject, premodifier (for noun predicates), and object of a
ties, including names, descriptions, and pronouns.              prepositional phrase modifying the predicate. For exam-
   The relation extraction components require detection         ple, "Smith traveled to Spain" is represented as trav-
and characterization of relations between (pairs of) enti-      eled(logical subject: Smith, PP-to: Spain), and "the U.S.
ties. In the 2002 ACE/EELD/TIDES task, twenty-four              president" is represented as president(premodifier: U.S.).
types of relations were grouped into five main categories:         It is then a simple matter to write patterns expressing
Role (the role a person plays in an organization), Part         basic concepts. Moreover, it is easy to automatically ex-
(part-whole relationships), At (location relationships),        tract and combine such patterns from sample relations in
Near (relative location relationships), and Social (person-     training data, giving the human expert a head start on
person relationships) (Linguistic Data Consortium 2002).        creating an appropriate inventory of rules. This approach
The target relation set for the 2004 ACE/TIDES task was         lays the foundation for the development of a fully trained
similar in size and composition. The structure of the           system to automatically extract relations.
2003 EELD relation set differs in that it does not group
its nineteen types of entity-entity relations into larger       3.2    Results
categories, but is similar to the ACE/TIDES sets in terms
of scope.                                                       The relation extraction system described above, com-
                                                                bined with statistical learning algorithms and handcrafted
                                                                patterns designed to extract entities and events, was de-
3     Using Propositions to Extract Rela-                       ployed in the 2003 EELD evaluation. In that evaluation,
      tions                                                     SERIF achieved 93% of human performance in extract-
                                                                ing entities, events, and relations and 96% of human per-
3.1    Approach                                                 formance in extracting relations given entities and events.
                                                                This surpassed our own initial goals for that evaluation,
BBN’s relation extraction system for the 2003 EELD              which were to achieve 80% of human performance on
evaluation was a pattern-based system. A set of patterns        entities with names, and 90% on relations when given the
(or "rules") were hand-crafted by a computational lin-          entities. This work also provided excellent groundwork
guist at BBN looking at sample data, and those rules            for research into fully trained relation extraction de-
were then automatically applied to identify and classify        scribed in the next two sections.
relations in the test data. For instance, a rule might indi-
cate that whenever the word "chairman" has a premodi-
fier that is an organization, there is an Employment rela-      4     Two Simple Approaches to Learning
tionship between that organization and that person.                   to Extract Relations
   The basic question involved in developing a pattern-         Although the use of propositions in pattern-based relation
based system is one of representation. How does one ex-         extraction was highly successful, a highly accurate set of
press rules in a way that contains all the necessary infor-     hand-crafted rules still requires significant expert effort
mation in a compact and efficient form? One would pre-          to create. Our goal of an easily adaptable system required
fer not to create patterns directly involving surface           that we explore methods for extracting relations not as
strings, from which extraneous information can some-            dependent on the contributions of highly trained person-
times be difficult to eliminate, or parse trees, which have     nel: to us, it seems far more desirable to have student-
many subtle variations in form, particularly when they          level staff annotate training data compared to devoting an
are being produced by an (imperfect) automatic parser.          expert computational linguist to handcrafting special pur-
pose patterns. Furthermore, the prospect of writing such       of these probability measures is computed as a smoothed
patterns in other languages is far less feasible than hiring   mixture of maximum likelihood estimates, using a for-
native speakers not trained in computational linguistics.      mula similar to Witten-Bell.
A learning algorithm would make it far easier to extend
the system to cover new types of relations and to port the     4.3    Feature Vector Model
system to new languages. In this section, we present our       In a second model, we express each mention pair
first learning algorithms for extracting relations. The        <M1,M2> with its syntactic connection as a set of feature
trained system slightly outperformed our rule-based sys-       values. We use maximum likelihood estimates to select
tem and would have been among the top systems in the           the most likely relation type.
2002 ACE/EELD/TIDES evaluation.                                   There are five features: the entity types of M1 and M2,
                                                               the syntactic roles they play in the proposition that con-
4.1    Approach                                                nects them, and the predicate of that proposition. For
We focused on places in the text where mentions of two         example, "BBN hired Smith" would produce the vector
different entities are syntactically linked within a single    [ORG PER subject object hire]. The probability estimate
phrase or clause. (Roughly 93% of the target relations are     for a relation type is a mixture of two maximum likeli-
expressed in such a manner.) Given a pair <M1,M2> of           hood estimates, one based on all five features, and one
noun phrases mentioning a corresponding pair of entities       based on all but the feature containing the predicate word
and the syntactic connection between those two men-            (the feature that is the most sparse.) Smoothing is done
tions, we use a classifier to predict the pair's most likely   using a formula similar to Witten-Bell.
relation type, if any. (We include coreference as an addi-        In general, this model's approach to relation classifica-
tional identity "relation" type.)                              tion is similar to that of traditional handcrafted patterns.
   In the following sections, we present two separate          The features are not treated independently (unlike the
classification models for finding relations from proposi-      generative model), and together they define a simple syn-
tions, one generative and one feature-based, and discuss       tactic construction that one might write a pattern to
their relative advantages and disadvantages. Finally, we       cover. In addition, the model can tolerate significant
present a combined model.                                      amounts of noise since it relies on a simplified represen-
                                                               tation of the situation to make its classification, and one
4.2    Generative Model                                        component even ignores the predicate (for cases where
In our first, generative model, propositions are repre-        insufficient information exists).
sented as binary trees, with a left branch (mention M1)
and a right branch (mention M2). Each node of the tree         4.4    Advantages and Disadvantages
represents either the predicate of a proposition (non-         The two models have different strengths. The feature
terminals) or the type of an entity mention (terminals).       vector model performs well on vectors similar to those
Each branch has a label representing the role that its         seen in training, and it also has the ability to generalize
child node plays in the parent proposition. So, "Smith         across unseen predicate words. For example, it can cor-
traveled to Spain" is represented as:                          rectly classify "Smith ice-skated in Brazil" as an
                                                               At.Located relation even though it has never seen the
                                                               predicate "ice-skate." On the other hand, it treats each
                   subject            PP-to                    pair of entity types as fundamentally separate from every
                                                               other pair. Thus, even when the training contains the ice-
                      PER             GPE                      skating example above—where the feature vector [PER
                                                               GPE ice-skate subject PP-in] predicts At.Located—the
Here, "travel" is the stemmed predicate of the proposi-        model will still know nothing about the vector [PER FAC
tion; "subject" and "PP-to" are the roles that the child       ice-skate subject PP-in], e.g., "Smith ice-skated in the
nodes play; and PER (person) and GPE (geo-political            local rink". Finally, this model does not deal cleanly
entity) are the entity types of M1 and M2. Where the           with nested constructions, so expressions such as "Smith
connection between the two mentions is a nested set of         made a deal with Anderson" are unlikely to be correctly
propositions (as in “Smith attended a conference in            classified.
Spain”), the tree is simply extended. This representation         The generative proposition model, on the other hand,
allows us to treat each part of the propositional structure    can handle nested propositional structures without diffi-
relatively independently.                                      culty. In addition, since the generative model treats the
   Our generative model then computes the probability          two arguments independently, it can classify relations
that a particular propositional structure conveys a par-       where it has accurate training information for the prob-
ticular relationship by estimating the joint probability of    abilities of each of the individual arguments, even if it
the relation and the structure, tracing out the sequence of    has not often seen them paired together. These independ-
decisions that together generate the relation and proposi-     ence assumptions in the generative model are often rea-
tional structure (e.g. now generate a right branch with        sonable and help to maximize the impact of the training
label PP-to) and calculating the probability of each. Each
data. On the other hand, they can sometimes lead the                word "manager" is often found in the same contexts as
model to overgeneralize.                                            "coordinator" and simultaneously say that the word
                                                                    "manager" is occasionally found in the same contexts as
4.5    Model Combination                                            the word "vendor." These two assertions (drawn from
The fact that the two models have contrasting advantages            real, automatically extracted word clusters) are, however,
suggests trying a combination model. In our combination             not independent of each other.            The more non-
model, we allow the connection between a mention pair               independent features introduced into a model, the more
to be classified as a relation if and only if both models           difficult it becomes to design a generative model that
found some relation present. If the models do both find a           adequately manages all such dependencies.
relation but disagree as to which type of relation it is, we           We therefore employed discriminative learning algo-
output the relation type chosen by the feature vector               rithms (which have no such feature independence con-
model, under the assumption that the specific relation              straints) in order to take advantage of the wide set of fea-
type sometimes depends on the specific combination of               tures potentially useable in such a framework. The algo-
the two entity types.                                               rithm that drives the relation classifier here is a percep-
                                                                    tron-style Viterbi training algorithm.
4.6    Results                                                         This larger feature space primarily included the use of
We ran our algorithms over the official evaluation data             word clusters extracted from unannotated corpora,
for the 2002 ACE/EELD/TIDES evaluation, supplying                   WordNet synsets, surface-level features connecting pairs
the relation extraction component with correct entity in-           of mentions, and features derived from predicate-
put in order to measure its performance in isolation from           argument structures. Results of these explorations are
other factors such as co-reference. Table 1 breaks down             described in detail below.
the comparison between the learned model and our for-
mer rule-based system into more detail, showing that that
                                                                    5.2    Model
the learning algorithm outperforms our state-of-the-art             The new system uses a voted perceptron algorithm to
handcrafted rules on each of the five test sets from the            classify the relationship between any two mentions in a
2002 evaluation.                                                    sentence. After the relation classification is finalized,
                                                                    argument ordering (e.g., whether A hired B or vice versa)
                                 rules        statistical           is determined by a simple maximum likelihood model
       broadcast news            36.1            40.4               based on the entity types and the predicate-argument
       eeld-1                    42.2            44.3               connection (if any). The features for the voted perceptron
       eeld-2                    50.9            52.7               are generated by 11 different feature templates, each a
       newspaper                 32.1             34                combination of one or more atomic features. Each atomic
       newswire                  42.6            43.9               feature contains one of the following pieces of informa-
                                                                    tion: the entity or mention type of one or both mentions,
       ALL                        39             41.3
                                                                    an aspect of the predicate-argument structure connecting
Table 1: The learning algorithm outperformed our state-of-the-art   the two mentions, a stemmed version of a relevant predi-
                       rule-based system                            cate, an aspect of the string connecting the two mentions,
                                                                    a WordNet synset of a relevant predicate, and a word
                                                                    cluster containing a relevant predicate. In particular, the
5     An Improved Learning Algorithm                                last three kinds of atomic features were not available to
We next developed a new system based on discriminative              the 2003 system. These we discuss in further detail now.
models designed to take advantage of a wider range of               Semantic Features
overlapping information. This new system outperformed
the approach to relation finding described in Section 4             The first new features we wanted to explore were those
(hereafter called the 2003 system) and provides a new,              that could encode semantic information extracted from
easily modifiable component that can be retrained for any           outside resources—either from WordNet or from word
target relation set.                                                clusters generated from unannotated corpora. The as-
                                                                    sumption is that although the predicate "flee" may never
5.1    Approach                                                     have been seen in our annotation, it appears in the same
                                                                    word cluster or WordNet synset as "leave," which we
The driving motivation behind the development of a new
                                                                    may have seen. Knowing that the two words are con-
relation finder was a desire to move beyond a restricted
                                                                    nected gives us an additional handle on such an instance.
set of features to a variety of inter-dependent features.
                                                                    This is a prime example of general linguistic information
For example, we believe that there is significant and use-
                                                                    being applied to a new task or domain without added
ful information gained by extracting semantic structures
                                                                    human effort.
from large unannotated corpora. This kind of informa-
                                                                      Obviously each predicate word is likely a member of
tion, however, seems best represented as a collection of
                                                                    many WordNet synsets, ranging from the most specific
potentially overlapping descriptions of a word or con-
                                                                    ("president") to the most general ("person", "living
struction: for instance, one might want to say that the
thing"). The structure of the word clusters is analogous:     not surprising; given the same features, a carefully con-
the clustering is done by organizing the N most frequent      structed generative model should outperform a relatively
words in a corpus into a tree based on the similarity of      more simple discriminative algorithm.
the contexts they appear in, and a cluster is just a branch
of that tree (Brown et al. 1992). Each branch higher in             75
the tree includes a set of lower (more "specific")
branches, just as a higher synset in the WordNet tree in-           70
cludes all of its hyponym sets. To avoid redundancy but                       65.7
still capture a breadth of information, we generate a fea-          65
ture for every third synset and every fourth word cluster.
Surface-Level Features
The second new type of feature was designed to capture              55
surface-level structure potentially containing information                   2003          2004     adding         adding
missed by a model that relies fully on propositional struc-                  model        model    semantic         string
ture for its information. For instance, in the phrase "his                               with 2003 features       features
wife and daughter," the parser will rarely create a struc-                               features
ture where "his" modifies "daughter." The 2003 system
would for this reason usually miss the relation between             Figure 2: Relation performance given various feature sets
"his" and "daughter." On the other hand, the surface-
level feature "PER wife and PER" is quite indicative (as
well as common enough to occur in the training data), so      5.4        Performance Maximization
that with the addition of such a feature the 2004 system      The new statistical model outperformed the 2003 model
correctly classifies this instance as a family relation.      but still under-predicted relations. Because the 2003
                                                              model performed less well but had very high precision,
5.3   Results                                                 we were able to use the old model to improve the recall
We ran both the 2003 and 2004 algorithms over the offi-       of the new: if the new model did not predict a relation
cial 2004 ACE/TIDES evaluation data, in the manner            between two mentions, but the old model did, we allowed
described in Section 4.6. As shown in Figure 1 below, the     the old model's prediction to be added to the full 2004
2004 system outperforms the 2003 system by several            output. This operation, analogous to the previous mixing
points.                                                       of the two 2003 sub-models, successfully improved per-
                                                              formance by more than 2 points. Figure 3 below shows
                                                              the 2003 baseline, the discriminative model using the
                                                              2004 features, and the full 2004 system that includes in-
                                         68.9                 put from the old model.
      65                                                            75
                                                                    70                        66.3
      60                                                                       65.7
              2003 system            2004 system
       Figure 1: Improved relation performance in 2004                     2003 m odel     using 2004      full 2004
                                                                                            features        m odel
  In Figure 2, we specifically demonstrate the impact of
adding both sets of additional features to the model,                    Figure 3: Combining the models from 2003 and 2004
compared both to the 2003 model and to a baseline 2004                               gives the best performance
model that uses only the features that were available to
the 2003 model (proposition-based, no word clusters or        This system was submitted as a part of the 2004
WordNet). The third column shows the improvement              ACE/TIDES evaluation, where it was the top-performing
gained from adding the "semantic" features discussed          system participating in the connect-the-dots RDR evalua-
above. The fourth column shows the impact of adding the       tion (the EELD task of extracting relations, given enti-
surface-level ("string") features discussed above.            ties). Its performance was state-of-the-art, among the top
  Without the advantage of new features, the 2004 model       performing systems, in the composite RDR task of de-
does not match the performance of the old model. This is
tecting entities and relations among those detected enti-       nese and Arabic. SERIF has also been deployed on a
ties.                                                           wide range of other tasks, including automated extraction
                                                                of organizational structure from message traffic and open
6    Related work                                               domain question answering.
Given a set of sentences annotated with relations, Miller,
et al. (2000) describes a procedure that rewrites auto-         References
matically generated parse trees for those sentences into
relation-augmented trees; those trees form the training         Brown, P., Della Pietra, V., deSouza, P., Lai, J., Mercer,
data for a statistical parser. A simple traversal of the tree   R. 1990. Class-based N-gram Models of Natural Lan-
                                                                guage. Computational Linguistics, 18(4), pp. 467-479.
converts the relation-augmented tree produced by the
parser into the extracted relations. This system was for-       Collins, M., and Duffy, N. 2002. New Ranking Algo-
mally evaluated in MUC-7 on the TR relations (person            rithms for Parsing and Tagging: Kernels over Discrete
works for organization, organization located at place, and      Structures, and the Voted Perceptron. In Proceedings of
organization makes product). Like Miller, our approach          ACL 2002, Philadelphia, PA, pp. 263-270.
uses full parsing, but the classifier is based on proposi-
tions rather than parse trees.                                  Miller, S., Guinness, J., and Zamanian, A. 2004. Name
   Zelenko, et al. (2002) apply support vector machines         Tagging with Word Clusters and Discriminative Train-
(SVM) to extract relations based on shallow parsing.            ing. In Proceedings of HLT/NAACL 2004, pp. 337-342.
The SVM was applied to ACE RDC after that publication           Miller, S., Ramshaw, L., Fox, H., and Weischedel, R.
with excellent results. Like Zelenko, our approach views        200. A Novel Use of Statistical Parsing to Extract Infor-
the task as a classifier applied to text mentioning entities,   mation from Text. In Proceedings of 1st Meeting of the
though we explored mixtures of several classifiers. Both        North American Chapter of the ACL, Seattle, WA,
efforts are promising and warrant further research and          pp.226-233.
   Our use of word cluster features builds on the work of       Linguistic Data Consortium. 2002. “ACE Phase 2:
Miller et al. (2004), which used word cluster features for      Information        for        LDC        Annotators,”
name finding. That approach combines the distributional
word clustering of Brown et al. (1990) with discrimina-
tive modeling methods trained using the voted perceptron        Zelenko, D., Aone, C., and Richardella, A. 2002. Kernel
approach described in Collins and Duffy (2002).                 Methods for Relation Extraction. In Proceedings of the
                                                                Conference on Empirical Methods in Natural Language
7    Conclusion                                                 Processing 2002, Philadelphia, PA, pp. 71-78.
BBN’s SERIF system achieved both goals set forth at the
outset of its development: high performance and quick
adaptability. SERIF was the best-performing system or
among the top-performing systems in each of the English
ACE or EELD evaluations over the past three years, and
the system is also easily portable to a new domain and a
new set of entity, event, and relation types. Most of its
component models require at most new training examples
annotated by non-experts. For tasks involving general
linguistic principles such as co-reference and parsing, no
change at all is necessary. SERIF’s cross-domain per-
formance was shown to be relatively robust with no
changes to either the system or the training data, with
only small degradation on the surprise "Al-Qaida" data
set in the 2003 EELD evaluation. Cross-lingual adapta-
tion is also possible given training examples; SERIF has
been adapted for Arabic and Chinese, under DARPA’s
Translingual Information Detection, Extraction, and
Summarization (TIDES) program.
   SERIF now serves as the extraction engine at the core
of BBN’s FactBrowser™ system, which fills a database
with information extracted about entities and relations
and provides for visualization of that data through a web
browser. FactBrowser has been trained and tested on
dozens of entity types, and versions exist for both Chi-

To top