Multi-document Summarisationand the PASCAL Textual Entailment - PDF

W
Document Sample
scope of work template
							      Multi-document Summarisation and the PASCAL Textual
                      Entailment Challenge
                  Nicola Stokes                                      Eamonn Newman
            NICTA Victoria Laboratory,                           School of Computer Science
          Department of Computer Science                              and Informatics,
             and Software Engineering,                            University College Dublin,
              University of Melbourne.                                     Ireland.
           nicola.stokes@nicta.com.au                              eamonn.newman@ucd.ie

                   Abstract                                repetition of information is an indication of in-
A fundamental problem for systems that re-                 formation importance and consequently sum-
quire natural language understanding capabili-             mary relevancy. The simplest method for de-
ties is the identification of instances of semantic         termining commonality across documents is to
equivalence and paraphrase in text. The PAS-               group text units (e.g. sentences, paragraphs)
CAL Recognising Textual Entailment (RTE)                   that exhibit a high concentration of word over-
challenge is a recently proposed research ini-             lap. However, this approximate method for
tiative that addressed this problem by provid-             recognising similar semantic content is often in-
ing an evaluation framework for the develop-               sufficient due to instances of language variabil-
ment of generic “semantic engines” that can be             ity such as paraphrase and synonymy. Figure 1
used to identify language variability in a variety         shows two sentences (A and B) that are seman-
of applications such as Information Retrieval,             tically equivalent but syntactically different.
Machine Translation and Question Answering.
This paper discusses the suitability of the RTE              Text A: Agassi’s dream run is ended by world’s
evaluation datasets as a framework for evaluat-              number one player.
ing the problem of redundancy recognition in                 Text B: Federer beats Agassi.
multi-document summarisation, i.e. the iden-
tification of repetitive information across docu-           Figure 1: Paraphrases with minimal word over-
ments. This paper also reports on the develop-             lap
ment of an additional dataset containing exam-
ples of informationally equivalent sentence pairs             In this paper we discuss the suitability of the
that are typically found in machine generated              recently proposed PASCAL Recognising Tex-
summaries. The performance of a competitive                tual Entailment (RTE) challenge (Dagan et al.,
entailment recognition system on this dataset is           2005a) as an evaluation methodology for deter-
also reported.                                             mining the performance of redundant informa-
                                                           tion identification techniques in the context of
1   Introduction                                           MDS. The aim of the RTE challenge is to aid the
The aim of multi–document summarisation                    development of generic “semantic engines” that
(MDS) is to generate a concise and coherent                can be used in a number of applications such as
summary given a cluster of related documents.              Information Retrieval, Information Extraction,
Although this process is a natural extension of            Text Summarisation and Machine Translation.
single–document summarisation, MDS poses a                 Two types of language variability were investi-
number of unique challenges such as, how to                gated in this year’s challenge: exact paraphrases
manage contradictory and repetitive informa-               and textual entailment or subsumption. The
tion in the cluster, and how to order extracted            evaluation was defined as a binary classification
information in the resultant summary. A pop-               problem where participating systems were re-
ular approach to the MDS problem is to first                quired to identify entailment relationships be-
identify and cluster repetitive information units          tween sentence pairs, i.e. a sentence A entails
across documents, then select representative               another sentence B if the meaning of B can be
sentences from the “dominant” clusters, and                inferred from the meaning of A (Dagan et al.,
finally generate an extractive summary from                 2005b). During the data collection effort for
these sentences. This approach assumes, like               the challenge, annotators were asked to limit
many others in text summarisation, that the                the number of “difficult” cases of entailment

               Proceedings of the Australasian Language Technology Workshop 2005, pages 215–223,
                                        Sydney, Australia, December 2005.
                                                    215
that they included in the dataset. The entail-           mance in other RTE applications settings. Fi-
ment pairs shown in Figure 2 are representa-             nally Section 7, discusses some conclusions and
tive of the level of difficulty of the subsumption         future directions for this work.
relationships found in the data, where entail-
ment, in the majority of cases, requires syntac-         Task=Comparable Documents;              Para-
tic matching, and synonym/paraphrase recog-              phrase Example; Judgement=TRUE
nition rather than complex logical inference. In         Text A: Satomi Mitarai died of blood loss.
this way techniques that recognise redundant             Text B: Satomi Mitarai bled to death.
information in MDS and entailment in the RTE
challenge have a lot in common. This point will          Task=Comparable Documents; Textual
be discussed in more detail in Section 2 of the          Entailment Example; Judgement=TRUE
paper.                                                   Text A: The Kota (Fort), or Old City, for ex-
   The RTE development and test sets are com-            ample, sometimes called the downtown section,
posed of entailment examples taken from seven            is the central business district and Indonesia’s
distinct application settings. The ”Compara-             financial capital.
ble Documents” portion of the collection is in-          Text B: The Kota is the country’s business
tended to be representative of the types of              center.
entailment and semantic equivalence found in
multi-document summarisation. In general,                Task=Comparable Documents; Syntactic
participating systems at the RTE workshop per-           Variation Example; Judgement=TRUE
formed significantly better (achieving as high            Text A: Jakarta floods easily during the rainy
as 87% accuracy) on this portion of the corpus.          season.
This result suggests that the types of entailment        Text B: Jakarta is easily flooded during the
and semantic equivalence found in MDS are sig-           rainy season.
nificantly less challenging than entailment found
in other application settings. In this paper we          Figure 2: Examples of syntactic variation, para-
will show that this result is misleading, and that       phrase and information subsumption in the
the difficulty of identifying language variability         RTE dataset
in MDS is comparable with the level of diffi-
culty observed in the other application domains
explored by the RTE initiative.                          2       MDS and RTE
   The following section motivates the need for
                                                         So why is the RTE challenge an attractive sub–
evaluating sub–tasks in MDS such as redundant
                                                         task evaluation methodology for MDS? Firstly,
information identification, and provides a brief
                                                         identifying semantic relationships and correctly
overview of the techniques that have been used
                                                         clustering informationally equivalent sentences
to identify language variability in MDS and at
                                                         is a critical analysis component in many MDS
the RTE challenge. Section 3 discusses the RTE
                                                         systems for the following reasons: if sentences
framework in the context of MDS and argues for
                                                         are incorrectly clustered then the commonal-
the inclusion of additional examples of language
                                                         ity between the documents is harder to deter-
variablity that frequently require identification
                                                         mine, and redundant (i.e. repetitive) infor-
in MDS but are not represented in the current
                                                         mation will be included in the summary – an
RTE evaluation dataset. Section 5 describes the
                                                         outcome that summarisation systems want to
University College Dublin (UCD) RTE system,
                                                         avoid at all costs. Secondly, there are inher-
which detects entailment between sentence pairs
                                                         ent limitations with the current summarisation
using linguistic and statistical language analy-
                                                         evaluation standard provided by the Document
sis techniques1 . Section 6 discusses the perfor-
                                                         Understanding Conference (DUC)2 , where both
mance of this system at the RTE workshop. In
                                                         automatic and manual evaluation strategies are
addition, the results of some initial experiments
                                                         used to measure summary quality in terms of
are provided that support the assertion that the
                                                         coverage, information redundancy, readability,
performance of a competitive RTE system in an
MDS application is comparable with its perfor-               2
                                                             DUC is an annual NIST sponsored workshop that
                                                         provides participants with summarisation tasks and a
  1
    The author was involved in the development of this   corresponding evaluation framework, i.e. corpora, gold
system before moving to the NICTA Victoria Research      standard summaries and evaluation metrics.
Laboratory.                                              http://duc.nist.gov




                                               216
coherence and grammaticality. Since its cre-         adjectives (e.g. WordNet (Millar, 1995)) or a
ation in 2001, the DUC initiative has helped         combination of both. Syntactic–based overlap
to ensure that real and transparent progress         measures, which involves calculating the degree
is being made in summarisation research; how-        of match between parse tree representations of
ever, because the DUC evaluation methodology         the sentence pair, were also popular. A few
is determining the performance of many diffi-          groups also incorporated a logical prover with
cult natural language processing (NLP) com-          some additional world knowledge resource such
ponents concurrently (i.e. semantic analysis,        as a geospatial ontology or a semantic taxon-
content selection, sentence ordering and natu-       omy. Many of the submitted systems, such as
ral language generation), it is often difficult to     the UCD submission described in the following
establish which techniques employed by a par-        section, considered more than just one of these
ticular high performing summarisation system         measures during the entailment recognition pro-
have contributed most to its overall success.        cess. More specifically, these lexical, syntac-
As such summarisation researchers are recog-         tic, semantic or logical–based inference mea-
nising the need for distinct evaluation frame-       sures were used as partial (rather than conclu-
works for each of these sub-components. For          sive) evidence of the presence or absence of an
example, researchers at Columbia University          entailment relationship between two sentences.
have separately evaluated their sentence cluster-       Overall the entailment recognition accuracy
ing algorithm, SimFinder, which is employed in       (see Section 6 for definition) of the participating
their NewsBlaster summarisation system (McK-         systems at the workshop ranged from 50-60%
eown et al., 2002). More recently Barzilay and       where accuracy measures greater than 0.535 and
Lapata (Barzilay and Lapata, 2005) describe          0.546 are better than chance at the 0.05 and 0.01
an evaluation methodology for text coherence         level, respectively (Dagan et al., 2005b). The
techniques, which are commonly used by sum-          general conclusion of the workshop was that
marisation systems to improve text readability.      relatively simple metrics used in combination
The following subsection provides a flavour of        performed better than more complex, “deeper”
the Entailment and Semantic Equivalence tech-        metrics such as logical inference or the incorpo-
niques presented at the PASCAL RTE–2005              ration of world knowledge into the classification
challenge, followed by a description of two im-      computation. An obvious explanation for this
portant contributions made by Text Summari-          outcome is that deep linguistic analysis meth-
sation researchers in this area.                     ods are more prone to errors than simple term
                                                     overlap metrics due to additional complexities
2.1   Language Variability Recognition               such as word sense disambiguation.
      Techniques                                        So how do RTE techniques compare to the
The 2005 PASCAL RTE challenge is described           repetitive information detection methods used
by the organisers as “an initial attempt to form     by the text summarisation community? Well
a generic empirical task that captures major se-     as already stated, summarisation researchers
mantic inferences across applications” (Dagan        have tended to favour simple similarity metrics
et al., 2005b). Sixteen groups submitted their       based on the number of shared words. There
RTE system results to the workshop. The sys-         are a couple of notable exceptions, however,
tems used a broad range of linguistic knowl-         which have been investigated by researchers at
edge resources, statistical association metrics      Columbia University.
and logical inference mechanisms. As already            Possibly the most well-known and success-
stated, the simplest type of semantic equiva-        ful approach to similarity detection in auto-
lence measure that can be used to identify en-       matic summarisation is the SimFinder (Hatzi-
tailment is a measure of vocabulary overlap.         vassiloglou et al., 2001) algorithm. This al-
Consequently, nearly all of the systems at the       gorithm clusters sentences that share thematic
workshop considered uni-gram or n-gram over-         content determined by a set of similarity fea-
lap metrics when classifying entailment. A           tures based on word, stem and Wordnet con-
number of more sophisticated methods were            cept overlap as well as more complex features
also proposed. These measures either used sta-       that capture match at a syntactic level such
tistical cooccurrence metrics (e.g. latent seman-    as subject-verb and verb-object relations. The
tic indexing), lexical resources for detecting se-   subsequent clustering of sentences is then per-
mantic relationships between verbs, nouns, and       formed using a non-hierarchical clustering tech-



                                               217
nique. Representative sentences from these            As already stated, the motivation behind this
clusters are then used to generate a summary.       paper is to establish whether or not these exam-
    (Barzilay and McKeown, 2005a) describe a        ples of language variablity are reflective of the
revision strategy for improving the readabil-       types of information redundancy found in an
ity of the summary output of the SimFinder          MDS setting. Particularly in the case of the CD
algorithm. Their revision system, MultiGen,         sentence pairs which are reportedly representive
searches for semantically equivalent textual        of the MDS task. To answer this question we
units in the dependency tree graph represen-        considered Mani’s analysis of this problem in
tations of the summary sentences. Semanti-          his review of MDS methods, where he defines 4
cally similar words and phrases are identified       distinct types of redundancy between text ele-
using the WordNet taxonomy and a paraphrase         ments in MDS (Mani, 2001):
dictionary, automatically constructed from par-
allel monolingual corpora. So once an over-          1. Two text elements are string identical when
lapping paraphrase has been detected in the             they are exact repetitions, i.e. the same
dependency trees this analysis then facilitates         sentence is repeated in multiple articles.
“information fusion”, i.e. the generation of a       2. Two text elements are semantically equiv-
single sentence that represents the information         alent when they are exact paraphrases of
in the overlapping sentences. This text gen-            each other.
eration technique has been integrated into the
                                                     3. Two text elements are informationally
Columbia NewsBlaster multi–document sum-
                                                        equivalent if they are judged by humans to
marisation system (McKeown et al., 2002).
                                                        contain the same information.
   It is clear from this discussion that the Text
Summarisation community has much to gain             4. A text element A informationally subsumes
from, and contribute to, the advancement of En-         text element B if the information in ele-
tailment and Semantic Equivalence recognition           ment B is contained in A.
research.
                                                    A manual examination of the RTE datasets
3       RTE and language variability in             shows that string identity and informational
        MDS                                         equivalence are not represented in these col-
                                                    lections. Figure 2 provides examples of para-
In this section of the paper we comment on the      phrase and informational subsumption, i.e. tex-
coverage of the RTE evaluation corpora with re-     tual entailment in the RTE data. The exclu-
spect to the type of real-world examples of se-     sion of string identical examples isn’t consid-
mantic equivalence that require detection dur-      ered critical as the detection of exact repetition
ing multi-document summarisation. For the           is trivial. However, the lack of Mani’s informa-
RTE 2005 challenge two development collec-          tional equivalence type examples is more trou-
tions and one test collection where released to     blesome. An example of informational equiva-
participants3 . In each case, the datasets con-     lence is shown in Figure 3. What differentiates
sisted of an even number of positive and neg-       this example of language variablity from those
ative examples of entailment between sentence       in Figure 2, is that the common information
pairs. During the development of these datasets     unit is an embedded paraphrase surrounding in
annotators were asked to collect relevant exam-     both sentences by additional information. More
ples that corresponded to typical success and       specifically, while Text A and B share the infor-
failure settings in seven different applications,    mation unit: “American Airlines laid off flight
i.e. Information Retrieval (IR), Information Ex-    attendants”, they also contain additional non-
traction (IE), Machine Translation (MT), Ques-      overlapping information units, i.e. the federal
tion Answering (QA), Paraphrase Acquisition         judge turned aside a union bid to block the job
(PP), Reading Comprehension (RC) and Com-           losses; unions warned travellers to expect long
parable Documents–style tasks (CD) such as          delays due to protests. From our analysis we
multi–document summarisation. A more de-            can conclude that examples of exact paraphrase
tailed discussion of the annotation process can     and entailment are the exception rather than
be found in (Dagan et al., 2005b).                  the rule in MDS and other CD–type applica-
    3
    The RTE datasets can be downloaded from:
                                                    tions. More often than not these systems will be
http://www.pascal-network.org/Challenges/RTE/       required to deal with noisier instances of seman-
Datasets                                            tic equivalence where sentences repeat embed-



                                           218
Task=MDS; Embedded Paraphrase Ex-                   resentive of the types of informational equiv-
ample; Judgement=TRUE                               alence that are problematic in MDS. A subse-
Text A: American Airlines began laying off           quent analysis of the official DUC summary sub-
hundreds of flight attendants on Tuesday, after      missions to the multi-document summarisation
a federal judge turned aside a union bid to         task defined for the 2004 challenge (i.e. DUC
block the job losses.                               task 2) indicates that these NewsBlaster ex-
Text B: Unions have warned travellers that          amples are consistent with the types of repet-
they can expect long delays this weekend as         itive information that were missed by sentence
protests begin after American Airlines let a        clustering strategies employed by other top per-
large number of flight attendants go last week.      forming summarisation systems at the work-
                                                    shop.
                                                       In line with the task-specific subsets in the
Figure 3: An example of informational equiva-
                                                    RTE collection, the MDS dataset consists of 100
lence and embedded paraphrase
                                                    sentence pairs: 50 positive and 50 negative in-
                                                    stances of informational equivalence. Figure 4
ded information units rather than exhibit com-      shows an example of each classification type. In
plete semantic overlap (i.e. exact paraphrase)      the previous section it was explained that in or-
or subsumption.                                     der for a sentence pair to be tagged as a positive
   In MDS, if the system can successfully de-       instance of informational equivalence it had to
tect these fuzzier examples of information re-      share an information unit; however, no formal
dundancy it can make an informed decision on        definition of what constitutes such as unit was
whether to: (a) substituted one sentence for an-    provided. The formulation of such a definition
other in the summary without any critical loss      is a challenge in itself, and is currently receiving
of information or (b) fuse these sentences to-      significant attention from the Text Summari-
gether as proposed by (Barzilay and McKeown,        sation community in the context of summari-
2005a). Sentence fusion would probably be the       sation evaluation (Nenkova and Passonneau,
most appropriate option in the case of the em-      2004; Amigo, 2004). In the context of this task,
bedded paraphrase example shown in Figure 3.        an information unit is defined as a unit of text
With this type of natural language generation       that contains at least one subject-verb relation-
application in mind, it would be beneficial if the   ship, (i.e. a noun phrase like “Air France Flight
RTE classification task also required systems to     358” is not a large enough information unit but
explicitly identify and return the common infor-    “Air France Flight 358 crashed” is). In addition,
mation unit(s) between each sentence pair, i.e.     when choosing these examples annotators were
the system must justify its classification deci-     asked to be mindful of the underlying classifi-
sion.                                               cation task in the context of a summarisation
                                                    application, i.e. would the inclusion of both
4   An MDS-based Informational                      sentences result in unnecessary repetition in a
    Equivalence Dataset                             summary. Any disagreement between annota-
This section describes the development of a         tor regarding the classification of certain pairs
complementary RTE-style corpus of sentence–         was discussed and resolved before experimenta-
pairs that are more reflective of the types          tion on the corpus began.
of information redundancy observed during              From the MDS examples in Figure 4 it can
multi-document summarisation.4 . Annotators         also be seen that these sentences often make
were asked to use Columbia’s online News-           reference to vague temporal expressions such as
Blaster summarisation system5 (a consistent         “deadline...set for Monday” and “Monday dead-
top-performer at the annual DUC summarisa-          line”. In order to ground these temporal refer-
tion evaluation workshop) to aquire relevant        ences to points in time the full text of the orig-
sentence pairs. This curation strategy was em-      inal source document would need to be anal-
ployed to ensure that the MDS dataset was rep-      ysed. However, temporal resolution is not nec-
   4
                                                    essary in this classification task since examples
     The MDS corpus can be downloaded from:         were carefully chosen to ensure that if an event
http://www.cs.mu.oz.au/~nstokes/TE/MDS_corpus_
1.0.xml                                             (such as a “suicide bomb attack”) is mentioned
   5
     The NewsBlaster summarisation system:          in both sentences, then the system can assume
http://newsblaster.cs.columbia.edu                  that this information unit is referring to the



                                              219
same instance of the event in time.                    Task=Comparable Documents; Judge-
                                                       ment=False;
                                                       Text A: Jennifer Hawkins is the 21-year-old
Task=MDS;           Pair     Id=4;          Judge-
                                                       beauty queen from Australia.
ment=TRUE;
                                                       Text B: Jennifer Hawkins is Australia’s 20-
Text A: The United States ratcheted up its
                                                       year-old beauty queen.
pressure Saturday on Iraqi negotiators who are
trying to meet a deadline for writing a draft
constitution set for Monday.                           Figure 5: An example of contradiction in the
Text B: With Iraq’s parliament facing a Monday         RTE data collection.
deadline to approve a new constitution, Presi-
dent Bush said Saturday that the document “is          5    The UCD Textual Entailment
a critical step on the path to Iraqi self-reliance”.        Recognition System
Task=MDS;          Pair    Id=62;        Judge-        In this section, we present an overview of the
ment=FALSE;                                            UCD Textual Entailment Recognition system,
Text A: Discovery was loaded with nearly 7,000         which was originally presented at the PASCAL
pounds of garbage that had accumulated in              RTE workshop (Newman et al., 2005). This
the space station since it was last visited by a       system uses a decision tree classifier to detect
shuttle in December 2002.                              an entailment relationship between pairs of sen-
Text B: The Discovery crew spent nine of their         tences that are represented using a number of
first 13 days in orbit transferring supplies to         difference features such as lexical, semantic and
the space station.                                     grammatical attributes of nouns, verbs and ad-
                                                       jectives. This entailment classifier was gener-
                                                       ated from the RTE training data using the C5.0
Figure 4: Pair 4 and Pair 62 are examples of           machine learning algorithm (Quinlan, 1993).
positive and negative informational equivalence        The features used to train and test the classi-
in the MDS dataset.                                    fier were calculated using the following similar-
                                                       ity measures:
  With regard to the negative examples of in-              • The ROUGE (Recall–Oriented Understudy
formation overlap in the MDS corpus, sentence                for Gisting Evaluation) (Lin and Hovy,
pairs were picked from summaries that con-                   2004) n-gram overlap metrics, which have
tained some word overlap, but which would still              been used as a means of evaluating sum-
be considered unique information contributors                mary quality at the DUC summarisation
to a summary. This helped to ensure that these               workshop. The Rouge package provides
negative sentence pairs were non–trivial.                    measurement options such as uni-gram, bi-
                                                             gram, tri-gram and 4-gram term overlap,
   During the creation of this corpus a num-
                                                             and a weighted and unweighted longest
ber of examples of “contradiction” (i.e. con-
                                                             common subsequence overlap measure.
flicting news reports on the details of a specific
event) between potential informationally equiv-            • The Cosine Similarity metric calculates the
alent sentence pairs were found. Although these              cosine of the angle between the respective
examples represent another important problem                 term vectors of the sentence pair.
in MDS, they were not included in the final ver-            • The Hirst–St-Onge WordNet–based mea-
sion of the corpus because they frequently oc-               sure (Millar, 1995), is an edge counting
cur in the RTE challenge datasets in the form of             metric that estimates the semantic dis-
negative entailment examples as shown in Fig-                tance between words by counting the num-
ure 5.                                                       ber of relational links between them in
   In the following sections we describe the UCD             the WordNet taxonomy (Budanitsky and
RTE system, and compare its performance on                   Hirst, 2001). This metric also defines con-
the MDS dataset to its performance on the RTE                straints on the length of the path and the
test set. As already stated, this experiment is              types of transitive relationships that are
used to investigate our claim that the CD task               allowed between concepts (nodes) in the
data in the RTE challenge is unrepresentative                taxonomy. These constraints are impor-
of language variability in MDS.                              tant because unlike other WordNet–based



                                             220
      semantic relatedness measures (which only           pairs (positive and negative) returned by
      consider IS–A relationships) the Hirst–St           the system divided by the number of sen-
      Onge metric searches for paths that tra-            tence pairs in the dataset.
      verse the IS–A and HAS–A hierarchies in
                                                        • A confidence-weighted score (CWS) that
      the noun taxonomy. Hence, this metric
                                                          ranges between 0 (no correct judgements at
      provides better coverage at an increased
                                                          all) and 1 (perfect score), and rewards the
      risk of detecting spurious relationships if
                                                          system when it assigns a higher confidence
      unrestricted paths were allowed between
                                                          score to correct judgements rather than to
      concepts. This feature was implemented
                                                          incorrect ones.
      using the Perl Wordnet Similarity modules
      developed by (Patwardhan et al., 2003).
                                                       Task=Paraphrase Acquisition; Judge-
    • A verb–specific semantic overlap met-             ment=FALSE
      ric, that uses the VerbOcean semantic            Text A: France on Saturday flew a planeload
      network (Chklovski and Pantel, 2004b;            of United Nations aid into eastern Chad where
      Chklovski and Pantel, 2004a) to identify         French soldiers prepared to deploy from their
      instances of antonymy and near-synonym           base in Abeche towards the border with Su-
      between verbs. The relationships between         dan’s Darfur region.
      verb–pairs in VerbOcean were gleaned from
      the web using lexico–syntactic patterns.         Text B:France on Saturday crashed a planeload
      Although WordNet provides a verb taxon-          of United Nations aid into eastern Chad
      omy, the VerbOcean data was used because
      it appears to provide better coverage of the    Figure 6: The Longest Common Subsequence is
      types of relationships needed for detecting     highlighted in italics.
      entailment.
    • A Latent Semantic Indexing (Deerwester             The UCD RTE and MDS results are shown
      et al., 1990) measure, like the WordNet         in Table 1. The entailment classifier in the
      measure, attempts to calculated similarity      MDS and RTE experiments was trained using
      beyond vocabulary overlap by identifying        the RTE corpus training sets (dev1 and dev2).
      latent relationships between words though       The average accuracy and CWS scores (0.565
      the analysis of cooccurrence statistics in an   and 0.6 respectively), and the task results listed
      auxiliary news corpus.                          below this row in the table represent the official
                                                      UCD results reported at the RTE 2005 work-
    • The final similarity measure is based on         shop. A manual analysis of these results showed
      a more thorough examination of verb se-         that many of the misclassified errors made by
      mantics. This measure finds the longest          the UCD system could be attributed to the oc-
      common subsequence in the sentence–pair,        currence of equivalence phrasal and composi-
      and then detects evidence of contradiction      tional paraphrases e.g. “X invented Y” = “Y
      or entailment in the subsequence (such as       was incubated in the mind of X”. As explained
      verb negation, synonymy, near-synonymy,         in Section 5 the system can only identify word–
      and antonymy) using the VerbOcean tax-          level, atomic paraphrase units (e.g., child = kid;
      onomy. An example is shown in Figure 6.         eat = devour) that are defined in the VerbO-
  A more detailed description of the UCD              cean and WordNet lexical resources. A more
system can be found in (Newman et al., 2005).         detailed discussion of system misclassifications
                                                      is provided in (Newman et al., 2005).
                                                         Out of 16 groups UCD’s average accuracy and
6    Language Variability Recognition                 CWS scores were ranked 4th and 5th respec-
     Experiments and Results                          tively, where system accuracy results ranged
This section of the paper reports on the perfor-      from 0.586 to 0.495 and CWS scores from 0.686
mance of the UCD RTE system on the RTE and            to 0.507. In general, systems performed signif-
MDS datasets. The RTE challenge defined two            icantly better on the CD entailment examples,
evaluation metrics:                                   and for many it was this score that added some
                                                      respectability to their average accuracy score.
    • An accuracy score which is calculated as        The most plausible explanation for these high
      the number of correctly classified sentence      CD scores (as high as 87% accuracy), accord-



                                                221
ing to (Dagan et al., 2005b), is that vocabu-         tails the question/query) this is not the case
lary overlap metrics performed very well on this      for Comparable Documents-style tasks. The
task because sentence pairs containing common         results of an experiment on a complementary
terms were more likely to have the same mean-         dataset of MDS informational equivalence ex-
ings than in the other tasks. This implies that       amples using a competitive RTE system showed
MDS systems need nothing more than vocabu-            that identifying redundancy in MDS is more
lary overlap metrics, and that the negative ef-       challenging than the results on the Comparable
fect of errors from this component of an MDS          Documents portion of the RTE test set would
system is minimal. However, a comparison of           suggest. Consequently, if the ultimate aim of
the UCD system results on the CD and MDS              the PASCAL RTE challenge is to build “generic
language variablity examples suggests that re-        semantic engines” then future evaluations will
dundant information detection is as difficult as        also have to consider the identification of em-
the other tasks investigated, and that further        bedded (semantic and syntactic) paraphrases
research effort is also required in this area.         across sentences.
                                                         An obvious extention of this work would be
     Task              Accuracy      CWS              to incorporate the UCD RTE system into an
     MDS               0.5400        0.6006           MDS system, and compare its effect on sum-
     RTE Average       0.5650        0.6000           mary performance against a baseline semantic
     CD                0.7400        0.7764           equivalence measure such as cosine similarity.
     IE                0.4917        0.5260           It would also be interesting to further investi-
     IR                0.5444        0.6130           gate how well the RTE evaluation framework
     PP                0.5600        0.5006           simulates the process of identifying repetitive
     MT                0.5083        0.5130           information in MDS and other applications. In
     QA                0.5385        0.5006           a paper by Barzilay and Elhadad (2003), on
     RC                0.5286        0.5685           sentence alignment for monolingual comparable
                                                      corpora, it was shown that the effectiveness of
Table 1: RTE and MDS Accuracy and CWS                 the alignment process increased when the con-
results for the UCD entailment classifier.             text surrounding sentences was also considered.
                                                      This conclusion suggests that future RTE eval-
7   Conclusions                                       uations should also consider evaluating the role
                                                      of context in the entailment detection process,
This paper evaluates the RTE challenge as
                                                      where additional context is provided by the doc-
a potential evaluation framework for compar-
                                                      ument in which the sentence occurred.
ing the performance of redundant information
recognition strategies used in multi–document
summarisation (MDS) to detect informational           8   Acknowledgements
equivalence across documents. Most MDS sys-           The support of Enterprise Ireland and NICTA
tems use simple word counts to identify repet-        (National ICT Australia) is gratefully acknowl-
itive information. The problem with this ap-          edged.
proach is that many sentences that convey the
same information show little surface resem-
blance due to linguistic phenomenon such as           References
paraphrase and synonymy. The RTE challenge
provides an opportunity for summarisation re-         E. Amigo. 2004. An empirical study of informa-
searchers to evaluate more sophisticate redun-          tion synthesis tasks. In Association for Com-
dancy identification techniques independent of           putational Linguistics (ACL’04).
the summarisation task. However, an analysis          R. Barzilay and N. Elhadad. 2003. Sentence
of the RTE development and test sets show that          alignment for monolingual comparable cor-
this data is not representative of the types of in-     pora. In Empirical Methods in Natural Lan-
formational equivalence that require detection          guage Processing (EMNLP’03).
during the MDS process. More specifically, al-         R. Barzilay and M. Lapata. 2005. Modeling
though subsumption relationships are a natu-            local coherence: An entity-based approach.
ral occurrence in applications such as Question         In Association for Computational Linguistics
Answering and Information Retrieval (where              (ACL’05).
the answer/relevant document will always en-          R. Barzilay and K. McKeown. 2005a. Sentence



                                              222
   fusion for multidocument news summariza-           ing content selection in summarization: The
   tion. Computational Linguistics, 31(3).            Pyramid Method. In HLT–NAACL’04.
A. Budanitsky and G. Hirst. 2001. Seman-            E. Newman, N. Stokes, J. Dunnion, and
   tic distance in WordNet: An experimental,          J. Carthy. 2005. UCD IIRG approach to the
   application-oriented evaluation of five mea-        Textual Entailment Challenge. In the PAS-
   sures. In the Workshop on WordNet and              CAL Recognising Textual Entailment Chal-
   Other Lexical Resources, NAACL’01.                 lenge Workshop, pages 53–56.
T. Chklovski and P. Pantel. 2004a. Global           S. Patwardhan, J. Michelizzi, S. Banerjee, and
   path-based refinement of noisy graphs applied       T. Pedersen. 2003. WordNet::Similarity
   to verb semantics. In the International Joint      Perl Module http://search.cpan.org/
   Conference on NLP (IJCNLP-05), pages 11–           dist/wordnet-similarity/lib/wordnet/
   13.                                                similarity.%pm.
T. Chklovski and P. Pantel. 2004b. VerbO-           J.R. Quinlan. 1993. C5.0 machine learning al-
   cean: Mining the web for fine–grained seman-        gorithm. http://www.rulequest.com.
   tic verb relations. In Empirical Methods in
   Natural Language Processing (EMNLP-04).
I. Dagan, O. Glickman, and B. Magnini (eds).
   2005a. In the PASCAL Recognising Textual
   Entailment Challenge Workshop, April 11th-
   13th 2005, Southampton, UK.
I. Dagan, O. Glickman, and B. Magnini. 2005b.
   The PASCAL recognising textual entailment
   challenge. In the PASCAL Recognising Tex-
   tual Entailment Challenge Workshop 2005,
   pages 1–8.
S. Deerwester, S. Dumais, G. Furnas, T. Lan-
   dauer, and R. Harshman. 1990. Indexing
   by Latent Semantic Analysis. Journal of the
   American Society for Information Science.
V. Hatzivassiloglou, J. Klavans, M. Holcombe,
   R. Barzilay, Min-Yen Kan, and K. McKeown.
   2001. SimFinder: A flexible clustering tool
   for summarization. In the Workshop on Au-
   tomatic Summarization, NAACL-01.
C.-Y. Lin and E. Hovy. 2004. Automatic
   evaluation of summaries using n-gram co–
   occurence statistics. In the Document Under-
   standing Conference (DUC’04), National In-
   stitute of Standards and Technology.
I Mani. 2001. Automatic Summarization. John
   Benjamins (Natural language processing se-
   ries, edited by Ruslan Mitkov, volume 3),
   Amsterdam.
K. McKeown, R. Barzilay, D. Evans, V. Hatzi-
   vassiloglou, J. Klavans, A. Nenkova, C. Sable,
   B. Schiffman, and S. Sigelman. 2002. Track-
   ing and summarizing news on a daily basis
   with Columbia’s Newsblaster. In the Human
   Language Technology Conference (HLT’02).
G. Millar. 1995. WordNet: a lexical database
   for english. Communications of the ACM,
   38(11):39–41.
A. Nenkova and R. Passonneau. 2004. Evaluat-



                                              223

						
Related docs