U. WASHINGTON MACHINE READING PROPOSAL
Oren Etzioni and Daniel S. Weld (University of Washington, Seattle)
The Machine Reading Project aims to build a rapidly retargetable system that is able to read and extract important
information from large corpora (e.g., Web text & news feeds) with little or no training data. Unfortunately,
traditional information extraction techniques suffer from serious drawbacks: (1) they are highly domain-dependent
and require homogeneous corpora, whereas Web text may contain diverse writing styles and subjects, (2) each new
target concept and relation requires manually-labeled training data, which is expensive and slow to produce, and (3)
they are often too slow to effectively process Web-scale corpora. Two machine reading projects at the University of
Washington seek to overcome these drawbacks:
Etzioni’s group has developed the paradigm of Open Information Extraction. Open IE makes a single data-driven
pass over the corpus and extracts a large set of relational tuples without requiring any explicit human annotation. By
automatically discovering and extracting the relation phrases (in addition to the arguments), Open extractors, such as
ReVerb and Textrunner, have successfully scaled to millions of relations. Moreover, Open IE emphasizes
computational efficiency; for example, ReVerb processed the entire ClueWeb corpus in about one week. However,
existing Open IE methods have a weakness – they output relations and arguments as uninterpreted text strings with
limited connection to a semantic ontology.
Weld’s group has developed a set of techniques for Knowledge-based Weak Supervision. By bootstrapping from
existing KBs such as Wikipedia and Freebase, their Luchs and MultiR systems automatically learn extractors for
these relations without any manually-labeled training data. A novel probabilistic graphical model allows their
system to cope with the noise inherent in heuristically-labeled data and is also substantially more scalable than
previous techniques. For example, they have learned extractors for over 5000 semantic relations and processed large
corpora such as Wikipedia and the NY Times archive. However, KB weak supervision also has a weakness, since it
requires a set of ground tuples for each target relation.
Bridging the Gap with Ontology Induction. We propose to investigate a set of techniques for ontology
construction – learning models of new entity classes, new relations and new event frames. This direction combines
the best attributes of Open IE with the strengths of KB weak supervision, eliminating the weaknesses of each. By
mapping its output to these semantic classes, Open IE will better handle synonymy and polysemy and will produce
higher-quality results. By using the newly generated models, KB weak supervision will extend to handle a much
greater range of world relations, including temporal fluents, which are rarely defined explicitly, even in
comprehensive KBs like Freebase. At the highest level, our ontology generation process may be divided into four
steps: 1) corpus selection, 2) relation generalization, 3) frame construction, and 4) ontology mapping. As explained
below, several of these steps are optional and for many of the steps we anticipate implementing and comparing
Clustering Input Corpora: Input clustering is an optional step, which may precede detection of new relations. The
objective is to collect a semantically homogeneous textual corpus which contains many instances of similar semantic
relations. For example, we will cluster subsections of Wikipedia articles in a specific category (eg aircraft or
politicians) in order to discover passages which are shared across many articles of that class. For example, in the
case of military aircraft, there are Development, Design and Operational history are common subsections. By
focusing the ontology learner’s attention on just these passages, we expect to increase the quality of the proposed
Proposing New Semantic Relations: The goal of this component is to automatically learn a structured
representation for a large number of semantic relations. With each semantic relation we wish to associate the
number of arguments, the types of each argument and a set of open relation phrases that can be used to represent that
relation in text. We will subsequently generalize this to a probabilistic extraction model using MultiR.
To construct such a database at scale, we will follow the following steps. First, we will run Open IE on a large
corpus. Second, we will identify the common relation phrases that are found in this corpus. We may subdivide each
relation phrase into potentially several subgroups, if multiple types are found in the arguments, since the phrase may
be mixing different semantics relations (e.g., ‘was born in’ mixes the location and year of birth). The arguments in
each group will now act as clean seeds for input to the weak-supervision techniques. This will learn an extractor for
an input relation phrase subgroup.
Our next task is to merge synonymous relation phrases into a single semantic relation. We will achieve this using
data-driven clustering. We plan to employ an EM-like approach to co-cluster the relation phrases and argument
strings. The KBs can add important information here, since many synonymous ways to refer to an entity are already
noted in FreeBase. In the end, we will obtain a cluster of synonymous relation phrases (e.g., ‘is a native of’ and
‘grew up in’) and an associated extractor with each cluster. We can use these extractors to populate a large database
of semantic relations.
Generating New Semantic Frames: Just as Open IE has freed IE from the need to specify relations in advance,
document level analysis needs to move beyond pre-specified semantic frames, such as the MUC-4 bombing
template that has slots for perpetrator, target, victims, device, location, and date. Manually engineering these frames
or templates is a major bottleneck to automatic document understanding.
We propose to leverage Open IE to discover Open Semantic Frames from large text corpora. We begin by creating
a RelGrams language model from co-occurrence statistics over Open IE tuples, which gives the conditional
probability of seeing a relational tuple t1 given tuple t2. From this we can create a relation graph, where the nodes
are generalized Open IE tuples and the edges are probabilities given by RelGrams. We can automatically detect
strongly connected sub-graphs as clusters of semantically coherent relations. Figure 1 shows a graph cluster
centered on the tuple (bomb, kill, *) that we have found in preliminary experiments with RelGrams. We propose to
create Open Semantic Frames from such graph clusters by semantically typing the open slots. For example, we can
discover that the open slot in (bomb, kill, *) is filled by Person and that the open slot in (*, claim, responsibility) is
filled by an Organization. We expect to find thousands of Open Semantic Frames from Web-scale text corpora.
Our formal evaluation of Open Semantic Frames will include both internal evaluations (recall-precision of learned
frames, and scalability to very large text corpora) and assessing the added value of RelGram statistics for the end
task of document summarization. We will use MUC and ACE templates as a gold standard as well as tagging other
learned frames; in particular we will do a direct comparison with Chambers and Jurafsky’s system that discovered
event templates from the MUC-4 corpus. We will freely share the RelGrams knowledge base as well as the
discovered Open Semantic Frames with the NLP research community.
Figure 1: A graph cluster strongly connected to the Open IE tuple (bomb, kill, *) forms the basis
of an Open Semantic Frame for a bombing event, with open slots for perpetrator, target, victims,
organization claiming responsibility, etc. Solid lines are edges between the tuple (bomb, kill, *)
and its immediate neighbors. Dotted lines are edges between neighbors themselves.
Ontology Mapping: We wish to relate newly-discovered relations and frames to existing ontologies, since this can
improve the quality of extractors and also facilitate the induction of first-order inference rules. We will start by
extending the mapping algorithm developed as part of Velvet, our ontological smoothing technique, since it explores
the large space of database views defined using SQL joins, selections and unions and computes the maximum
likelihood joint mapping given only a set of ground tuples. However, we will need to extend this approach, because
many mappings may only be approximate and hence aggregate operators should be considered.
STATEMENT OF WORK
We propose the following deliverables for 2012
Our ontology-construction work is based on the ReVerb and MultiR extractors developed by our groups. As a
precursor to understanding the limitations of our relation-generation mechanisms, we will compare the precision
and recall of these extractors to ….what?? NELL, BBN? How will we do this if they don’t give us their code
or run their system on a tightly controlled dataset?
We will produce a system for proposing new semantic relations, which operates by clustering the output of open
extractions from a large corpus. A novel Fromungulator algorithm will convert these to a giant relgram DB….
Comparisons with chamers&Jurafsky?
We will produce an EM-based corpus segmentation and clustering algorithm, which identifies semantically
similar passages in Wikipedia articles of similar type. We will evaluate the effect of corpus segmentation by
comparing the output of the relation discovery phase with and without cluster partitioning.
We will produce an extended ontology mapping engine, Suede, which extends Velvet and takes as input a
newly proposed relation, R, with selectional preference and extracted instance information and as output
determines if R is novel or whether a database view over one of the background KBs is likely semantically
equivalent to R. We will annotate our RelGram corpus with predicted ontological mappings when they exist.