Baselines for Recognizing Textual Entailment

Document Sample
Baselines for Recognizing Textual Entailment Powered By Docstoc
					Baselines for Recognizing Textual Entailment
Ling 541 Final Project Terrence Szymanski

What is Textual Entailment?
Informally: A text T entails a hypothesis H if the meaning of H can be inferred from the meaning of T.  Example:

 T:

Profits nearly doubled to nearly $1.8 billion.  H: Profits grew to nearly $1.8 billion.  Entailment holds (is true).

Types of Entailment
 

For many entailments, H is simply a paraphrase of all or part of T. Other entailments are less obvious:
 T:

Jorma Ollila joined Nokia in 1985 and held a variety of key management positions before taking the helm in 1992  H: Jorma Ollila is the CEO of Nokia.


~95% human level of agreement on entailment judgments

The PASCAL RTE Challenge


First challenge held in 2005 (RTE1)
 16

entries  System performances ranged from 50% to 59% accuracy.  Wide array of approaches, using word overlap, synonymy/word distance, statistical lexical relations, dependency tree matching…


Second challenge is underway (RTE2)

What is BLEU?
BLEU was designed as a metric to measure the accuracy of machinegenerated translations by comparing them to human-generated gold standards.  Scores based on n-gram overlap (typically for n=1,2,3 and 4) and penalizes for brief translations.  Application for RTE?


Using the BLEU Algorithm for RTE


Proposed by Perez & Alfonseca in RTE1.
 Use

the traditional BLEU algorithm to capture n-gram overlap between T-H pairs.  Find a cutoff score such that a BLEU score above the cutoff implies a TRUE entailment (otherwise FALSE)  Roughly 50% accuracy: simple baseline.


However: intuitively, the BLEU algorithm is not ideal for RTE
 BLEU

was designed for evaluating MT systems  BLEU could be adjusted to better suit the RTE task.

Modifying the BLEU Algorithm
 

Entailments are normally short; thus it does not make sense to penalize them for being short. BLEU uses a geometric mean to average the ngram overlap for n=1,2,3, and 4
 If

any value of n produces a zero score, the entire score is nullified.



Therefore: modify the algorithm to not penalize for brevity, use a linear weighted average.

sn   si
i 1

N

wi

Modifying the BLEU Algorithm


Original BLEU



Modified BLEU

si 

ctest ,ref ctest
N wi i 1

si 

ctest ,ref ctest
N i 1

s N   si

s N   wi si

sbleu  bs N

sbleu  s N

wi is the weighting factor (universally set to 1/N) b is the brevity factor (see paper for details) ctest,ref is the count of n-grams appearing in both test and ref, and ctest is the count of total n-grams appearing in test.

Performance Comparison
Ran both unmodified and modified BLEU algorithm on the RTE1 data sets.  Used the development set to obtain the cutoff score  Use the test set as the evaluation data


Cutoff Score for BLEU
The unmodified algorithm produces a high percentage of zero scores (67%).  Not surprisingly, the cutoff score is zero!


Cutoff Score for BLEU
Determing Cutoff Score (Original BLEU Algorithm)
55.0% 54.0% 53.0% 52.0% 51.0% 50.0% 49.0% 48.0% 47.0%

Accuracy

0

0.01

0.02

0.03

0.05

0.06

0.07

0.09

0.12

0.15

0.17

0.23

0.28

0.32

0.35

0.46

0.59

BLEU Score

Two equivalent cutoff scores: 0 and 0.13. Both offer 53.8% accuracy, but the zero cutoff was used because it is a natural candidate for cutoff.

0.71

0.2

Cutoff Score for Modified BLEU
Modified BLEU produces a continuum of scores, unlike the original BLEU  Need to find the optimal cutoff score that maximizes accuracy.


Cutoff Score for Modified BLEU
Determing Cutoff Score (Modified BLEU Algorithm)
60.0% 58.0% 56.0% 54.0% 52.0% 50.0% 48.0% 46.0% 44.0%

Accuracy

0

0.08

0.12

0.14

0.17

0.23

0.26

0.28

0.31

0.34

0.37

0.42

0.45

0.55

0.68

0.2

0.5

0.6

Modified BLEU Score

Optimal cutoff score is found to be 0.221

0.8

Validity of cutoff scores?
Verifying Cutoff Score (Original BLEU Algorithm)
53.0%
56.0% 55.0%

Verifying Cutoff Score (Modified BLEU Algorithm)

52.0%

54.0% 53.0%

51.0%

Accuracy

50.0%

Accuracy

52.0% 51.0% 50.0% 49.0% 48.0% 47.0%

49.0%

48.0%

47.0%

46.0%

0

0.01

0.03

0.04

0.05

0.07

0.09

0.12

0.14

0.16

0.18

0.22

0.25

0.33

0.45

0.58

0.1

0.3

0.8

0

0.09

0.13

0.16

0.19

0.21

0.23

0.26

0.29

0.31

0.34

0.37

0.41

0.46

0.51

0.57

0.62

BLEU Score

Modified BLEU Score





The original BLEU seems to have a good natural cutoff score of zero The modified BLEU optimal cutoff varies depending on the data set, although 0.221 is an acceptable value (future data may be needed for optimization; also the cutoff may be task-specific).

0.79

0.7

Results on RTE1 Data
 

Original BLEU Development Set:
 Cutoff

 

Modified BLEU Development Set:
 Cutoff

score = zero  Accuracy = 53.8%


score = 0.221  Accuracy = 57.8%


Test Set:
 Accuracy

Test Set:
 Accuracy

= 52.0%

= 53.8%

Results on RTE2 Data
 

Original BLEU Development Set:
 

 

Modified BLEU Development Set:
 

Cutoff score = zero Accuracy = 56.0%

Cutoff score = 0.221


Accuracy = 60.4%



Test Set:


Cutoff score = 0.25


???

Accuracy = 61.4%



Test Set:


???



RTE2 test set will be released in January.

Comparison of Results
BLEU Development Set (RTE1) Test Set (RTE1) 53.8 52.0 56.0 57.8 53.8 60.4 Modified BLEU Pérez & Alfonseca 54 49.5 n/a RTE1 Best n/a 58.6 n/a

Development Set (RTE2)




Accuracy scores for four systems: Original BLEU, Modified BLEU, Perez & Alfonseca’s implementation of BLEU, and the best submission to the RTE1 Challenge. Modified BLEU is better than the other versions of BLEU, but nowhere near the best system performance.

End Results


Modified BLEU algorithm outperforms the original BLEU algorithm for RTE
 Consistent

2-4% increase in accuracy



Does this mean that modified BLEU is a candidate system for RTE applications?

NO: BLEU is a baseline algorithm
 



“Don’t climb a tree to get to the moon.” BLEU (and other n-gram based methods) are good baselines, but lack the potential for future improvement. Example:
 T:

It is not the case that John likes ice cream.  H: John likes ice cream.  Perfect n-gram overlap, but entailment is FALSE.

Future Improvements
 

 

Potential exists to add word-similarity enhancements, such as synonym substitution, etc. Rather than think of these as enhancements to the BLEU algorithm, we should think of the BLEU algorithm as a baseline for measuring the benefit offered by such improvements. i.e. Performance of BLEU vs. Performance of BLEU after synonym substitution. => Evaluate the benefit synonym substitution can have on a larger RTE system.

Conclusions


The BLEU algorithm can be modified to better suit the RTE task
  

Modifications are theory-motivated Eliminate brevity penalty, use linear rather than geometric mean Performance benefits: Modified BLEU consistently has 2-4% higher accuracy. Lacks the capacity to incorporate future developments Can be used to measure performance benefits of various enhancements.



Still, BLEU is only a baseline algorithm
 