Recognizing Textual Entailment Challenge PASCAL

Document Sample
scope of work template
							Recognizing Textual
Entailment Challenge
PASCAL




  Suleiman BaniHani
Textual Entailment
 Textual entailment recognition: is the
  task of deciding, given two text
  fragments, whether the meaning of
  one text is entailed (can be inferred)
  from another text.
Task Definition
 Given pairs of small text snippets, referred to as Text-
  Hypothesis (T-H) pairs. Build a system that will decide
  for each T-H pair whether T indeed entails H or not.
  Results will be compared to the manual gold standard
  generated by annotators.

 Example:
    T: Kurdistan Regional Government Prime
     Minister Dr. Barham Salih was unharmed
     after an assassination attempt.
    Prime minister targeted for assassination
Dataset Collection and Application Settings


 The dataset of Text-Hypothesis pairs was
  collected by human annotators. It consists
  of seven subsets
     Information Retrieval (IR)
     Comparable Documents (CD)
     Reading Comprehension (RC)
     Question Answering (QA)
     Information Extraction (IE)
     Machine Translation (MT)
     Paraphrase Acquisition (PP)
Approaching textual entailment recognition

 Solution approaches can be categorizes
  as.
  1. Deep analysis or “understanding”
     – Using types of linguistic knowledge and
       resources to accurately recognize textual
       entailment
        Patterns of entailment (e.g. lexical relations,
         syntactic alternations)
        Processing technology (word co-occurrence
         statistics, thesaurus, parsing, etc.)
  2. Shallow approach
Baseline
 Given that half the pairs are FALSE, the
  simplest baseline is to label all pairs
  FALSE. This will achieve 50% accuracy.
Application of the BLEU (BiLingual
Evaluation Understudy) algorithm
  Shallow based on lexical level.
   It is based on calculating the percentage of n-grams
      for a given translation to the human standard one, a
      typical values for N are taken, i.e. 1, 2, 3, 4.
   It limits each n-gram appearance to a maximum
      frequency.
      The result of each n-gram is combined, and a
      penalty is added to short text.
 Scored 54% for development set, and a 50% in the
   test set.
 Good results in the CD, bad results in IE and IR.
 Problem, does not recognize syntactical or semantics,
   such as synonyms and antonyms.
Syntactic similarities
 Human annotators were asked to divide the
  data set to
     True by syntax
     False by syntax
     Not syntax
     Cannot decide
 Then using a robust parser to establish the
  results.
 A partial submission was provided. And
  humans were used for the test.
Tree edit distance
 The text as well as the Hypothesis is transformed to a
  tree using a sentence splitter and a parser to create
  the syntactic representation.
 A matching module, find the beast sequence of editing
  operations to obtain H from T.
 Each editing operation (deletion, insertion and
  substitution) is given a relative score.
 Finally the total score is evaluated, if it exceeds a
  certain limit them the pair is labeled as true.
 High accuracy for CD but 55% overall accuracy.
 Should be enriched by using resources as WordNet
  and other libraries.
Dependency Analysis and WordNet
   A dependency parser is used to normalized data in
    appropriate tree representation.
   Then a lexical entailment module is used, where the sub
    branches of T an H can be entailed from the other using
       Synonymy and similarity
       Hyponymy and WordNet entailment, i.e. death entail kill.
       Multiwords, i.e. melanoma entails skin-cancer.
       Negation and antonymy, where negation is propagated
        through tree leaves.
   A matching between dependency trees using a matching
    algorithm searching for matching branches between T and
    H.
   Results show high score in CD and a between 42 to 55 % in
    other fields
Syntactic Graph Distance: a rule
based and a SVM based approach
 Use a graph distance theory, where a graph
  is used to represent the H and T pair.
 Use similarity measures to determine
  entailment
   T semantically subsumes H, e.g. H: [The cat
    eats the mouse] and T: [the cat devours the
    mouse], eat generalizes devour).
   T syntactically subsumes H, e.g., H: [The cat
    eats the mouse] and T: [the cat eats the mouse
    in the garden], T contains a specializing
    prepositional phrase).
   T directly implies H (e.g., H: [The cat killed the
    mouse], T: [the cat devours the mouse]).
Cont.
 A rule based system realize the following
   Node similarity
   Syntactic similarity
   Semantic similarity
 Applying a machine learning technique to
  evaluate the parameters and make the final
  decision
 Results high for CD .76 and .44-.59 for
  others
hierarchical knowledge
   representation
 A hierarchical logic passed representation o the T H
  pairs, where a description logic inspired language is
  used, extended feature description login (EFDL) which
  is similar to concept graph.
 Nodes in the graph represent words or phrases.
 Manually generated rewriting rules are used for
  semantic and syntactic representations.
 A sentence in the text can have different alternatives
 The evaluation is based if any of the sentence
  representations can infer H.
 Results in the system set 64.8 while in the test 56.1,
  high CD lowest QA 50%
Logic like formula representation
   A parser is used to transfer the pair
    T and H to graph, of logical
    phrases, where the nodes are the
    words and the links are the
    relations.
   A matching score is given for each
    pair of terms.
   The theorem proof is used to find
    the proof with the lowest coast.
   The final cost is evaluated is it is
    less than a threshold, then the
    entailment is proved.
   High results in the CD 79%, Lowest
    with MT 47% average 55%.
Atomic Propositions

   Find entailment relation by
    comparing the atomic proposition
    contained in the T and H.
   The comparison of the atomic
    propositions is done using a
    deduction system OTTER.
   The atomic propositions are
    extracted from the text using a
    parser.
   WordNet is used for word
    relations.
    A semantic analyzer is used to
    transform the output of the parser
    to first order logic.
   Low accuracy .5188 especially for
    QA 47%.
Combining shallow over lapping
technique with deep theorem
proving
   In the shallow stage a simple frequency test of over lapping
    words is used.
   In the deep stage CCG–parser is used to generate DRS,
    discourse representation theory. Which is transformed to
    first order logic.
   Vampire theorem prover and Paradox where used for
    entailment proof.
   A knowledge base was used to validate results with real
    world.
       WordNet
       Geographical knowledge from CIA
       Generic axioms for, for instance, the semantics of possessives,
        active-passives, and locations.
   The combined system has accuracy of .562 while the
    shallow approach has an accuracy of 0.55.
Applying COGEX logic prover
   First use parser to convert into logic.
   Then use COGEX, which is a modified version of OTTER.
   The prover requires a set of clauses called the “set of
    support” which is used to initiate the search for inferences.
   The set of support is loaded with the negated form of the
    hypothesis as well as the predicates that make up the text
    passage.
   Another list is required called the usable list, contains
    clauses used by OTTER to generate inferences.
   The usable list consists of all the axioms that have been
    generated either automatically or by hand.
       World Knowledge Axioms (Manually)
       NLP Axioms(SS and SM)
       WordNet Lexical Chains
   Overall accuracy .551, a lot of errors in the parsing stage.
Comparing task accuracy

                                           Average Accuracy By Task

                80.0    73.3
                70.0
   Accuracy %




                60.0                                     50.7         51.9            51.2   50.5
                                    47.7          48.3
                50.0
                40.0
                30.0
                20.0
                10.0
                 0.0
                         CD          QA           MT      PP          IE               IR    RC
                                                         Task



                       CD – Comparable Documents         IE – Information Extraction
                       QA – Question Answering           IR – Information Retrieval
                       MT – Machine Translation          RC – Reading Comprehension
                       PP – Paraphrasing
Rsullt comparison
Future work
 Search for a candidate parser to
  transform NL to first order logic.
 Use the largest set of KB to caputre
  similarity.
 Search for a robust theorem prover.
References
 The first PASCAL Recognising Textual
  Entailment Challenge (RTE I)
 Ido Dagan, Oren Glickman and Bernardo
  Magnini. The PASCAL Recognising Textual
  Entailment Challenge.
  In Proceedings of the PASCAL Challenges
  Workshop on Recognising Textual
  Entailment, 2005.
 http://www.cs.biu.ac.il/~glikmao/rte05/ind
  ex.html