Multi-document Summarisationand the PASCAL Textual Entailment - PDF
Document Sample


Multi-document Summarisation and the PASCAL Textual
Entailment Challenge
Nicola Stokes Eamonn Newman
NICTA Victoria Laboratory, School of Computer Science
Department of Computer Science and Informatics,
and Software Engineering, University College Dublin,
University of Melbourne. Ireland.
nicola.stokes@nicta.com.au eamonn.newman@ucd.ie
Abstract repetition of information is an indication of in-
A fundamental problem for systems that re- formation importance and consequently sum-
quire natural language understanding capabili- mary relevancy. The simplest method for de-
ties is the identification of instances of semantic termining commonality across documents is to
equivalence and paraphrase in text. The PAS- group text units (e.g. sentences, paragraphs)
CAL Recognising Textual Entailment (RTE) that exhibit a high concentration of word over-
challenge is a recently proposed research ini- lap. However, this approximate method for
tiative that addressed this problem by provid- recognising similar semantic content is often in-
ing an evaluation framework for the develop- sufficient due to instances of language variabil-
ment of generic “semantic engines” that can be ity such as paraphrase and synonymy. Figure 1
used to identify language variability in a variety shows two sentences (A and B) that are seman-
of applications such as Information Retrieval, tically equivalent but syntactically different.
Machine Translation and Question Answering.
This paper discusses the suitability of the RTE Text A: Agassi’s dream run is ended by world’s
evaluation datasets as a framework for evaluat- number one player.
ing the problem of redundancy recognition in Text B: Federer beats Agassi.
multi-document summarisation, i.e. the iden-
tification of repetitive information across docu- Figure 1: Paraphrases with minimal word over-
ments. This paper also reports on the develop- lap
ment of an additional dataset containing exam-
ples of informationally equivalent sentence pairs In this paper we discuss the suitability of the
that are typically found in machine generated recently proposed PASCAL Recognising Tex-
summaries. The performance of a competitive tual Entailment (RTE) challenge (Dagan et al.,
entailment recognition system on this dataset is 2005a) as an evaluation methodology for deter-
also reported. mining the performance of redundant informa-
tion identification techniques in the context of
1 Introduction MDS. The aim of the RTE challenge is to aid the
The aim of multi–document summarisation development of generic “semantic engines” that
(MDS) is to generate a concise and coherent can be used in a number of applications such as
summary given a cluster of related documents. Information Retrieval, Information Extraction,
Although this process is a natural extension of Text Summarisation and Machine Translation.
single–document summarisation, MDS poses a Two types of language variability were investi-
number of unique challenges such as, how to gated in this year’s challenge: exact paraphrases
manage contradictory and repetitive informa- and textual entailment or subsumption. The
tion in the cluster, and how to order extracted evaluation was defined as a binary classification
information in the resultant summary. A pop- problem where participating systems were re-
ular approach to the MDS problem is to first quired to identify entailment relationships be-
identify and cluster repetitive information units tween sentence pairs, i.e. a sentence A entails
across documents, then select representative another sentence B if the meaning of B can be
sentences from the “dominant” clusters, and inferred from the meaning of A (Dagan et al.,
finally generate an extractive summary from 2005b). During the data collection effort for
these sentences. This approach assumes, like the challenge, annotators were asked to limit
many others in text summarisation, that the the number of “difficult” cases of entailment
Proceedings of the Australasian Language Technology Workshop 2005, pages 215–223,
Sydney, Australia, December 2005.
215
that they included in the dataset. The entail- mance in other RTE applications settings. Fi-
ment pairs shown in Figure 2 are representa- nally Section 7, discusses some conclusions and
tive of the level of difficulty of the subsumption future directions for this work.
relationships found in the data, where entail-
ment, in the majority of cases, requires syntac- Task=Comparable Documents; Para-
tic matching, and synonym/paraphrase recog- phrase Example; Judgement=TRUE
nition rather than complex logical inference. In Text A: Satomi Mitarai died of blood loss.
this way techniques that recognise redundant Text B: Satomi Mitarai bled to death.
information in MDS and entailment in the RTE
challenge have a lot in common. This point will Task=Comparable Documents; Textual
be discussed in more detail in Section 2 of the Entailment Example; Judgement=TRUE
paper. Text A: The Kota (Fort), or Old City, for ex-
The RTE development and test sets are com- ample, sometimes called the downtown section,
posed of entailment examples taken from seven is the central business district and Indonesia’s
distinct application settings. The ”Compara- financial capital.
ble Documents” portion of the collection is in- Text B: The Kota is the country’s business
tended to be representative of the types of center.
entailment and semantic equivalence found in
multi-document summarisation. In general, Task=Comparable Documents; Syntactic
participating systems at the RTE workshop per- Variation Example; Judgement=TRUE
formed significantly better (achieving as high Text A: Jakarta floods easily during the rainy
as 87% accuracy) on this portion of the corpus. season.
This result suggests that the types of entailment Text B: Jakarta is easily flooded during the
and semantic equivalence found in MDS are sig- rainy season.
nificantly less challenging than entailment found
in other application settings. In this paper we Figure 2: Examples of syntactic variation, para-
will show that this result is misleading, and that phrase and information subsumption in the
the difficulty of identifying language variability RTE dataset
in MDS is comparable with the level of diffi-
culty observed in the other application domains
explored by the RTE initiative. 2 MDS and RTE
The following section motivates the need for
So why is the RTE challenge an attractive sub–
evaluating sub–tasks in MDS such as redundant
task evaluation methodology for MDS? Firstly,
information identification, and provides a brief
identifying semantic relationships and correctly
overview of the techniques that have been used
clustering informationally equivalent sentences
to identify language variability in MDS and at
is a critical analysis component in many MDS
the RTE challenge. Section 3 discusses the RTE
systems for the following reasons: if sentences
framework in the context of MDS and argues for
are incorrectly clustered then the commonal-
the inclusion of additional examples of language
ity between the documents is harder to deter-
variablity that frequently require identification
mine, and redundant (i.e. repetitive) infor-
in MDS but are not represented in the current
mation will be included in the summary – an
RTE evaluation dataset. Section 5 describes the
outcome that summarisation systems want to
University College Dublin (UCD) RTE system,
avoid at all costs. Secondly, there are inher-
which detects entailment between sentence pairs
ent limitations with the current summarisation
using linguistic and statistical language analy-
evaluation standard provided by the Document
sis techniques1 . Section 6 discusses the perfor-
Understanding Conference (DUC)2 , where both
mance of this system at the RTE workshop. In
automatic and manual evaluation strategies are
addition, the results of some initial experiments
used to measure summary quality in terms of
are provided that support the assertion that the
coverage, information redundancy, readability,
performance of a competitive RTE system in an
MDS application is comparable with its perfor- 2
DUC is an annual NIST sponsored workshop that
provides participants with summarisation tasks and a
1
The author was involved in the development of this corresponding evaluation framework, i.e. corpora, gold
system before moving to the NICTA Victoria Research standard summaries and evaluation metrics.
Laboratory. http://duc.nist.gov
216
coherence and grammaticality. Since its cre- adjectives (e.g. WordNet (Millar, 1995)) or a
ation in 2001, the DUC initiative has helped combination of both. Syntactic–based overlap
to ensure that real and transparent progress measures, which involves calculating the degree
is being made in summarisation research; how- of match between parse tree representations of
ever, because the DUC evaluation methodology the sentence pair, were also popular. A few
is determining the performance of many diffi- groups also incorporated a logical prover with
cult natural language processing (NLP) com- some additional world knowledge resource such
ponents concurrently (i.e. semantic analysis, as a geospatial ontology or a semantic taxon-
content selection, sentence ordering and natu- omy. Many of the submitted systems, such as
ral language generation), it is often difficult to the UCD submission described in the following
establish which techniques employed by a par- section, considered more than just one of these
ticular high performing summarisation system measures during the entailment recognition pro-
have contributed most to its overall success. cess. More specifically, these lexical, syntac-
As such summarisation researchers are recog- tic, semantic or logical–based inference mea-
nising the need for distinct evaluation frame- sures were used as partial (rather than conclu-
works for each of these sub-components. For sive) evidence of the presence or absence of an
example, researchers at Columbia University entailment relationship between two sentences.
have separately evaluated their sentence cluster- Overall the entailment recognition accuracy
ing algorithm, SimFinder, which is employed in (see Section 6 for definition) of the participating
their NewsBlaster summarisation system (McK- systems at the workshop ranged from 50-60%
eown et al., 2002). More recently Barzilay and where accuracy measures greater than 0.535 and
Lapata (Barzilay and Lapata, 2005) describe 0.546 are better than chance at the 0.05 and 0.01
an evaluation methodology for text coherence level, respectively (Dagan et al., 2005b). The
techniques, which are commonly used by sum- general conclusion of the workshop was that
marisation systems to improve text readability. relatively simple metrics used in combination
The following subsection provides a flavour of performed better than more complex, “deeper”
the Entailment and Semantic Equivalence tech- metrics such as logical inference or the incorpo-
niques presented at the PASCAL RTE–2005 ration of world knowledge into the classification
challenge, followed by a description of two im- computation. An obvious explanation for this
portant contributions made by Text Summari- outcome is that deep linguistic analysis meth-
sation researchers in this area. ods are more prone to errors than simple term
overlap metrics due to additional complexities
2.1 Language Variability Recognition such as word sense disambiguation.
Techniques So how do RTE techniques compare to the
The 2005 PASCAL RTE challenge is described repetitive information detection methods used
by the organisers as “an initial attempt to form by the text summarisation community? Well
a generic empirical task that captures major se- as already stated, summarisation researchers
mantic inferences across applications” (Dagan have tended to favour simple similarity metrics
et al., 2005b). Sixteen groups submitted their based on the number of shared words. There
RTE system results to the workshop. The sys- are a couple of notable exceptions, however,
tems used a broad range of linguistic knowl- which have been investigated by researchers at
edge resources, statistical association metrics Columbia University.
and logical inference mechanisms. As already Possibly the most well-known and success-
stated, the simplest type of semantic equiva- ful approach to similarity detection in auto-
lence measure that can be used to identify en- matic summarisation is the SimFinder (Hatzi-
tailment is a measure of vocabulary overlap. vassiloglou et al., 2001) algorithm. This al-
Consequently, nearly all of the systems at the gorithm clusters sentences that share thematic
workshop considered uni-gram or n-gram over- content determined by a set of similarity fea-
lap metrics when classifying entailment. A tures based on word, stem and Wordnet con-
number of more sophisticated methods were cept overlap as well as more complex features
also proposed. These measures either used sta- that capture match at a syntactic level such
tistical cooccurrence metrics (e.g. latent seman- as subject-verb and verb-object relations. The
tic indexing), lexical resources for detecting se- subsequent clustering of sentences is then per-
mantic relationships between verbs, nouns, and formed using a non-hierarchical clustering tech-
217
nique. Representative sentences from these As already stated, the motivation behind this
clusters are then used to generate a summary. paper is to establish whether or not these exam-
(Barzilay and McKeown, 2005a) describe a ples of language variablity are reflective of the
revision strategy for improving the readabil- types of information redundancy found in an
ity of the summary output of the SimFinder MDS setting. Particularly in the case of the CD
algorithm. Their revision system, MultiGen, sentence pairs which are reportedly representive
searches for semantically equivalent textual of the MDS task. To answer this question we
units in the dependency tree graph represen- considered Mani’s analysis of this problem in
tations of the summary sentences. Semanti- his review of MDS methods, where he defines 4
cally similar words and phrases are identified distinct types of redundancy between text ele-
using the WordNet taxonomy and a paraphrase ments in MDS (Mani, 2001):
dictionary, automatically constructed from par-
allel monolingual corpora. So once an over- 1. Two text elements are string identical when
lapping paraphrase has been detected in the they are exact repetitions, i.e. the same
dependency trees this analysis then facilitates sentence is repeated in multiple articles.
“information fusion”, i.e. the generation of a 2. Two text elements are semantically equiv-
single sentence that represents the information alent when they are exact paraphrases of
in the overlapping sentences. This text gen- each other.
eration technique has been integrated into the
3. Two text elements are informationally
Columbia NewsBlaster multi–document sum-
equivalent if they are judged by humans to
marisation system (McKeown et al., 2002).
contain the same information.
It is clear from this discussion that the Text
Summarisation community has much to gain 4. A text element A informationally subsumes
from, and contribute to, the advancement of En- text element B if the information in ele-
tailment and Semantic Equivalence recognition ment B is contained in A.
research.
A manual examination of the RTE datasets
3 RTE and language variability in shows that string identity and informational
MDS equivalence are not represented in these col-
lections. Figure 2 provides examples of para-
In this section of the paper we comment on the phrase and informational subsumption, i.e. tex-
coverage of the RTE evaluation corpora with re- tual entailment in the RTE data. The exclu-
spect to the type of real-world examples of se- sion of string identical examples isn’t consid-
mantic equivalence that require detection dur- ered critical as the detection of exact repetition
ing multi-document summarisation. For the is trivial. However, the lack of Mani’s informa-
RTE 2005 challenge two development collec- tional equivalence type examples is more trou-
tions and one test collection where released to blesome. An example of informational equiva-
participants3 . In each case, the datasets con- lence is shown in Figure 3. What differentiates
sisted of an even number of positive and neg- this example of language variablity from those
ative examples of entailment between sentence in Figure 2, is that the common information
pairs. During the development of these datasets unit is an embedded paraphrase surrounding in
annotators were asked to collect relevant exam- both sentences by additional information. More
ples that corresponded to typical success and specifically, while Text A and B share the infor-
failure settings in seven different applications, mation unit: “American Airlines laid off flight
i.e. Information Retrieval (IR), Information Ex- attendants”, they also contain additional non-
traction (IE), Machine Translation (MT), Ques- overlapping information units, i.e. the federal
tion Answering (QA), Paraphrase Acquisition judge turned aside a union bid to block the job
(PP), Reading Comprehension (RC) and Com- losses; unions warned travellers to expect long
parable Documents–style tasks (CD) such as delays due to protests. From our analysis we
multi–document summarisation. A more de- can conclude that examples of exact paraphrase
tailed discussion of the annotation process can and entailment are the exception rather than
be found in (Dagan et al., 2005b). the rule in MDS and other CD–type applica-
3
The RTE datasets can be downloaded from:
tions. More often than not these systems will be
http://www.pascal-network.org/Challenges/RTE/ required to deal with noisier instances of seman-
Datasets tic equivalence where sentences repeat embed-
218
Task=MDS; Embedded Paraphrase Ex- resentive of the types of informational equiv-
ample; Judgement=TRUE alence that are problematic in MDS. A subse-
Text A: American Airlines began laying off quent analysis of the official DUC summary sub-
hundreds of flight attendants on Tuesday, after missions to the multi-document summarisation
a federal judge turned aside a union bid to task defined for the 2004 challenge (i.e. DUC
block the job losses. task 2) indicates that these NewsBlaster ex-
Text B: Unions have warned travellers that amples are consistent with the types of repet-
they can expect long delays this weekend as itive information that were missed by sentence
protests begin after American Airlines let a clustering strategies employed by other top per-
large number of flight attendants go last week. forming summarisation systems at the work-
shop.
In line with the task-specific subsets in the
Figure 3: An example of informational equiva-
RTE collection, the MDS dataset consists of 100
lence and embedded paraphrase
sentence pairs: 50 positive and 50 negative in-
stances of informational equivalence. Figure 4
ded information units rather than exhibit com- shows an example of each classification type. In
plete semantic overlap (i.e. exact paraphrase) the previous section it was explained that in or-
or subsumption. der for a sentence pair to be tagged as a positive
In MDS, if the system can successfully de- instance of informational equivalence it had to
tect these fuzzier examples of information re- share an information unit; however, no formal
dundancy it can make an informed decision on definition of what constitutes such as unit was
whether to: (a) substituted one sentence for an- provided. The formulation of such a definition
other in the summary without any critical loss is a challenge in itself, and is currently receiving
of information or (b) fuse these sentences to- significant attention from the Text Summari-
gether as proposed by (Barzilay and McKeown, sation community in the context of summari-
2005a). Sentence fusion would probably be the sation evaluation (Nenkova and Passonneau,
most appropriate option in the case of the em- 2004; Amigo, 2004). In the context of this task,
bedded paraphrase example shown in Figure 3. an information unit is defined as a unit of text
With this type of natural language generation that contains at least one subject-verb relation-
application in mind, it would be beneficial if the ship, (i.e. a noun phrase like “Air France Flight
RTE classification task also required systems to 358” is not a large enough information unit but
explicitly identify and return the common infor- “Air France Flight 358 crashed” is). In addition,
mation unit(s) between each sentence pair, i.e. when choosing these examples annotators were
the system must justify its classification deci- asked to be mindful of the underlying classifi-
sion. cation task in the context of a summarisation
application, i.e. would the inclusion of both
4 An MDS-based Informational sentences result in unnecessary repetition in a
Equivalence Dataset summary. Any disagreement between annota-
This section describes the development of a tor regarding the classification of certain pairs
complementary RTE-style corpus of sentence– was discussed and resolved before experimenta-
pairs that are more reflective of the types tion on the corpus began.
of information redundancy observed during From the MDS examples in Figure 4 it can
multi-document summarisation.4 . Annotators also be seen that these sentences often make
were asked to use Columbia’s online News- reference to vague temporal expressions such as
Blaster summarisation system5 (a consistent “deadline...set for Monday” and “Monday dead-
top-performer at the annual DUC summarisa- line”. In order to ground these temporal refer-
tion evaluation workshop) to aquire relevant ences to points in time the full text of the orig-
sentence pairs. This curation strategy was em- inal source document would need to be anal-
ployed to ensure that the MDS dataset was rep- ysed. However, temporal resolution is not nec-
4
essary in this classification task since examples
The MDS corpus can be downloaded from: were carefully chosen to ensure that if an event
http://www.cs.mu.oz.au/~nstokes/TE/MDS_corpus_
1.0.xml (such as a “suicide bomb attack”) is mentioned
5
The NewsBlaster summarisation system: in both sentences, then the system can assume
http://newsblaster.cs.columbia.edu that this information unit is referring to the
219
same instance of the event in time. Task=Comparable Documents; Judge-
ment=False;
Text A: Jennifer Hawkins is the 21-year-old
Task=MDS; Pair Id=4; Judge-
beauty queen from Australia.
ment=TRUE;
Text B: Jennifer Hawkins is Australia’s 20-
Text A: The United States ratcheted up its
year-old beauty queen.
pressure Saturday on Iraqi negotiators who are
trying to meet a deadline for writing a draft
constitution set for Monday. Figure 5: An example of contradiction in the
Text B: With Iraq’s parliament facing a Monday RTE data collection.
deadline to approve a new constitution, Presi-
dent Bush said Saturday that the document “is 5 The UCD Textual Entailment
a critical step on the path to Iraqi self-reliance”. Recognition System
Task=MDS; Pair Id=62; Judge- In this section, we present an overview of the
ment=FALSE; UCD Textual Entailment Recognition system,
Text A: Discovery was loaded with nearly 7,000 which was originally presented at the PASCAL
pounds of garbage that had accumulated in RTE workshop (Newman et al., 2005). This
the space station since it was last visited by a system uses a decision tree classifier to detect
shuttle in December 2002. an entailment relationship between pairs of sen-
Text B: The Discovery crew spent nine of their tences that are represented using a number of
first 13 days in orbit transferring supplies to difference features such as lexical, semantic and
the space station. grammatical attributes of nouns, verbs and ad-
jectives. This entailment classifier was gener-
ated from the RTE training data using the C5.0
Figure 4: Pair 4 and Pair 62 are examples of machine learning algorithm (Quinlan, 1993).
positive and negative informational equivalence The features used to train and test the classi-
in the MDS dataset. fier were calculated using the following similar-
ity measures:
With regard to the negative examples of in- • The ROUGE (Recall–Oriented Understudy
formation overlap in the MDS corpus, sentence for Gisting Evaluation) (Lin and Hovy,
pairs were picked from summaries that con- 2004) n-gram overlap metrics, which have
tained some word overlap, but which would still been used as a means of evaluating sum-
be considered unique information contributors mary quality at the DUC summarisation
to a summary. This helped to ensure that these workshop. The Rouge package provides
negative sentence pairs were non–trivial. measurement options such as uni-gram, bi-
gram, tri-gram and 4-gram term overlap,
During the creation of this corpus a num-
and a weighted and unweighted longest
ber of examples of “contradiction” (i.e. con-
common subsequence overlap measure.
flicting news reports on the details of a specific
event) between potential informationally equiv- • The Cosine Similarity metric calculates the
alent sentence pairs were found. Although these cosine of the angle between the respective
examples represent another important problem term vectors of the sentence pair.
in MDS, they were not included in the final ver- • The Hirst–St-Onge WordNet–based mea-
sion of the corpus because they frequently oc- sure (Millar, 1995), is an edge counting
cur in the RTE challenge datasets in the form of metric that estimates the semantic dis-
negative entailment examples as shown in Fig- tance between words by counting the num-
ure 5. ber of relational links between them in
In the following sections we describe the UCD the WordNet taxonomy (Budanitsky and
RTE system, and compare its performance on Hirst, 2001). This metric also defines con-
the MDS dataset to its performance on the RTE straints on the length of the path and the
test set. As already stated, this experiment is types of transitive relationships that are
used to investigate our claim that the CD task allowed between concepts (nodes) in the
data in the RTE challenge is unrepresentative taxonomy. These constraints are impor-
of language variability in MDS. tant because unlike other WordNet–based
220
semantic relatedness measures (which only pairs (positive and negative) returned by
consider IS–A relationships) the Hirst–St the system divided by the number of sen-
Onge metric searches for paths that tra- tence pairs in the dataset.
verse the IS–A and HAS–A hierarchies in
• A confidence-weighted score (CWS) that
the noun taxonomy. Hence, this metric
ranges between 0 (no correct judgements at
provides better coverage at an increased
all) and 1 (perfect score), and rewards the
risk of detecting spurious relationships if
system when it assigns a higher confidence
unrestricted paths were allowed between
score to correct judgements rather than to
concepts. This feature was implemented
incorrect ones.
using the Perl Wordnet Similarity modules
developed by (Patwardhan et al., 2003).
Task=Paraphrase Acquisition; Judge-
• A verb–specific semantic overlap met- ment=FALSE
ric, that uses the VerbOcean semantic Text A: France on Saturday flew a planeload
network (Chklovski and Pantel, 2004b; of United Nations aid into eastern Chad where
Chklovski and Pantel, 2004a) to identify French soldiers prepared to deploy from their
instances of antonymy and near-synonym base in Abeche towards the border with Su-
between verbs. The relationships between dan’s Darfur region.
verb–pairs in VerbOcean were gleaned from
the web using lexico–syntactic patterns. Text B:France on Saturday crashed a planeload
Although WordNet provides a verb taxon- of United Nations aid into eastern Chad
omy, the VerbOcean data was used because
it appears to provide better coverage of the Figure 6: The Longest Common Subsequence is
types of relationships needed for detecting highlighted in italics.
entailment.
• A Latent Semantic Indexing (Deerwester The UCD RTE and MDS results are shown
et al., 1990) measure, like the WordNet in Table 1. The entailment classifier in the
measure, attempts to calculated similarity MDS and RTE experiments was trained using
beyond vocabulary overlap by identifying the RTE corpus training sets (dev1 and dev2).
latent relationships between words though The average accuracy and CWS scores (0.565
the analysis of cooccurrence statistics in an and 0.6 respectively), and the task results listed
auxiliary news corpus. below this row in the table represent the official
UCD results reported at the RTE 2005 work-
• The final similarity measure is based on shop. A manual analysis of these results showed
a more thorough examination of verb se- that many of the misclassified errors made by
mantics. This measure finds the longest the UCD system could be attributed to the oc-
common subsequence in the sentence–pair, currence of equivalence phrasal and composi-
and then detects evidence of contradiction tional paraphrases e.g. “X invented Y” = “Y
or entailment in the subsequence (such as was incubated in the mind of X”. As explained
verb negation, synonymy, near-synonymy, in Section 5 the system can only identify word–
and antonymy) using the VerbOcean tax- level, atomic paraphrase units (e.g., child = kid;
onomy. An example is shown in Figure 6. eat = devour) that are defined in the VerbO-
A more detailed description of the UCD cean and WordNet lexical resources. A more
system can be found in (Newman et al., 2005). detailed discussion of system misclassifications
is provided in (Newman et al., 2005).
Out of 16 groups UCD’s average accuracy and
6 Language Variability Recognition CWS scores were ranked 4th and 5th respec-
Experiments and Results tively, where system accuracy results ranged
This section of the paper reports on the perfor- from 0.586 to 0.495 and CWS scores from 0.686
mance of the UCD RTE system on the RTE and to 0.507. In general, systems performed signif-
MDS datasets. The RTE challenge defined two icantly better on the CD entailment examples,
evaluation metrics: and for many it was this score that added some
respectability to their average accuracy score.
• An accuracy score which is calculated as The most plausible explanation for these high
the number of correctly classified sentence CD scores (as high as 87% accuracy), accord-
221
ing to (Dagan et al., 2005b), is that vocabu- tails the question/query) this is not the case
lary overlap metrics performed very well on this for Comparable Documents-style tasks. The
task because sentence pairs containing common results of an experiment on a complementary
terms were more likely to have the same mean- dataset of MDS informational equivalence ex-
ings than in the other tasks. This implies that amples using a competitive RTE system showed
MDS systems need nothing more than vocabu- that identifying redundancy in MDS is more
lary overlap metrics, and that the negative ef- challenging than the results on the Comparable
fect of errors from this component of an MDS Documents portion of the RTE test set would
system is minimal. However, a comparison of suggest. Consequently, if the ultimate aim of
the UCD system results on the CD and MDS the PASCAL RTE challenge is to build “generic
language variablity examples suggests that re- semantic engines” then future evaluations will
dundant information detection is as difficult as also have to consider the identification of em-
the other tasks investigated, and that further bedded (semantic and syntactic) paraphrases
research effort is also required in this area. across sentences.
An obvious extention of this work would be
Task Accuracy CWS to incorporate the UCD RTE system into an
MDS 0.5400 0.6006 MDS system, and compare its effect on sum-
RTE Average 0.5650 0.6000 mary performance against a baseline semantic
CD 0.7400 0.7764 equivalence measure such as cosine similarity.
IE 0.4917 0.5260 It would also be interesting to further investi-
IR 0.5444 0.6130 gate how well the RTE evaluation framework
PP 0.5600 0.5006 simulates the process of identifying repetitive
MT 0.5083 0.5130 information in MDS and other applications. In
QA 0.5385 0.5006 a paper by Barzilay and Elhadad (2003), on
RC 0.5286 0.5685 sentence alignment for monolingual comparable
corpora, it was shown that the effectiveness of
Table 1: RTE and MDS Accuracy and CWS the alignment process increased when the con-
results for the UCD entailment classifier. text surrounding sentences was also considered.
This conclusion suggests that future RTE eval-
7 Conclusions uations should also consider evaluating the role
of context in the entailment detection process,
This paper evaluates the RTE challenge as
where additional context is provided by the doc-
a potential evaluation framework for compar-
ument in which the sentence occurred.
ing the performance of redundant information
recognition strategies used in multi–document
summarisation (MDS) to detect informational 8 Acknowledgements
equivalence across documents. Most MDS sys- The support of Enterprise Ireland and NICTA
tems use simple word counts to identify repet- (National ICT Australia) is gratefully acknowl-
itive information. The problem with this ap- edged.
proach is that many sentences that convey the
same information show little surface resem-
blance due to linguistic phenomenon such as References
paraphrase and synonymy. The RTE challenge
provides an opportunity for summarisation re- E. Amigo. 2004. An empirical study of informa-
searchers to evaluate more sophisticate redun- tion synthesis tasks. In Association for Com-
dancy identification techniques independent of putational Linguistics (ACL’04).
the summarisation task. However, an analysis R. Barzilay and N. Elhadad. 2003. Sentence
of the RTE development and test sets show that alignment for monolingual comparable cor-
this data is not representative of the types of in- pora. In Empirical Methods in Natural Lan-
formational equivalence that require detection guage Processing (EMNLP’03).
during the MDS process. More specifically, al- R. Barzilay and M. Lapata. 2005. Modeling
though subsumption relationships are a natu- local coherence: An entity-based approach.
ral occurrence in applications such as Question In Association for Computational Linguistics
Answering and Information Retrieval (where (ACL’05).
the answer/relevant document will always en- R. Barzilay and K. McKeown. 2005a. Sentence
222
fusion for multidocument news summariza- ing content selection in summarization: The
tion. Computational Linguistics, 31(3). Pyramid Method. In HLT–NAACL’04.
A. Budanitsky and G. Hirst. 2001. Seman- E. Newman, N. Stokes, J. Dunnion, and
tic distance in WordNet: An experimental, J. Carthy. 2005. UCD IIRG approach to the
application-oriented evaluation of five mea- Textual Entailment Challenge. In the PAS-
sures. In the Workshop on WordNet and CAL Recognising Textual Entailment Chal-
Other Lexical Resources, NAACL’01. lenge Workshop, pages 53–56.
T. Chklovski and P. Pantel. 2004a. Global S. Patwardhan, J. Michelizzi, S. Banerjee, and
path-based refinement of noisy graphs applied T. Pedersen. 2003. WordNet::Similarity
to verb semantics. In the International Joint Perl Module http://search.cpan.org/
Conference on NLP (IJCNLP-05), pages 11– dist/wordnet-similarity/lib/wordnet/
13. similarity.%pm.
T. Chklovski and P. Pantel. 2004b. VerbO- J.R. Quinlan. 1993. C5.0 machine learning al-
cean: Mining the web for fine–grained seman- gorithm. http://www.rulequest.com.
tic verb relations. In Empirical Methods in
Natural Language Processing (EMNLP-04).
I. Dagan, O. Glickman, and B. Magnini (eds).
2005a. In the PASCAL Recognising Textual
Entailment Challenge Workshop, April 11th-
13th 2005, Southampton, UK.
I. Dagan, O. Glickman, and B. Magnini. 2005b.
The PASCAL recognising textual entailment
challenge. In the PASCAL Recognising Tex-
tual Entailment Challenge Workshop 2005,
pages 1–8.
S. Deerwester, S. Dumais, G. Furnas, T. Lan-
dauer, and R. Harshman. 1990. Indexing
by Latent Semantic Analysis. Journal of the
American Society for Information Science.
V. Hatzivassiloglou, J. Klavans, M. Holcombe,
R. Barzilay, Min-Yen Kan, and K. McKeown.
2001. SimFinder: A flexible clustering tool
for summarization. In the Workshop on Au-
tomatic Summarization, NAACL-01.
C.-Y. Lin and E. Hovy. 2004. Automatic
evaluation of summaries using n-gram co–
occurence statistics. In the Document Under-
standing Conference (DUC’04), National In-
stitute of Standards and Technology.
I Mani. 2001. Automatic Summarization. John
Benjamins (Natural language processing se-
ries, edited by Ruslan Mitkov, volume 3),
Amsterdam.
K. McKeown, R. Barzilay, D. Evans, V. Hatzi-
vassiloglou, J. Klavans, A. Nenkova, C. Sable,
B. Schiffman, and S. Sigelman. 2002. Track-
ing and summarizing news on a daily basis
with Columbia’s Newsblaster. In the Human
Language Technology Conference (HLT’02).
G. Millar. 1995. WordNet: a lexical database
for english. Communications of the ACM,
38(11):39–41.
A. Nenkova and R. Passonneau. 2004. Evaluat-
223
Related docs
Get documents about "