Discovering Temporal and Causal Relations in Biomedical Texts

Document Sample
Discovering Temporal and Causal Relations in Biomedical Texts Powered By Docstoc
					Discovering Temporal and Causal
 Relations in Biomedical Texts
    Rutu Mulkar-Mehta, Jerry R. Hobbs,
   Chun-Chi Liu, Xianghong Jasmine Zhou
      University of Southern California
                             Approach
   Use our Natural Language Pipeline
        Parser              Logical Form   Abductive Engine
        (Charniak Parser)   (LFToolkit)    (Mini-TACITUS)

Input text                                        Output interpretation



   Using manually created axioms
   To discover temporal and causal relations in
     biomedical text
              Index
•   Problem
•   Data
•   Method
•   Results
                                                                                                     Problem
                              Introduction                                                           Data
                                                                                                     Method
                                                                                                     Results

Previous Work: 3 Learning by Reading Projects
1. Chemistry Texts: (Early 2006)
     •            Domain: High School chemistry textbook
     •            Learned: Descriptions of chemical reactions and chemical entities
     •            Result: System answered “true/false” and “what” questions
          •         Inference Engine: Powerloom
          •         Participants: USC-ISI
2.       Biology Texts: (2006-2007)
     •            Domain: Paragraphs from Wikipedia related to the heart
     •            Learned: Information about parts of the heart and events related to it
     •            Result: Complex textual models were created, “what” and “how” questions
                  were answered
          •         Backend Knowledgebase: Knowledge Machine
          •         Participants: USC-ISI, University of Texas- Austin, SRI, Noah Friedland
3.       Engine Texts: (2007-2008)
     •            The complexity of the domain made the problem novel
              •       Participants: USC-ISI, University of Texas- Austin, SRI, Noah Friedland, BBN
                                                                   Problem
  Limitations of Previous Work                                     Data
                                                                   Method
                                                                   Results


• All 3 systems learned about entities:
    – heart, blood, valve, fuel
• All 3 systems learned about the relationships between the entities
    – heart pumps blood
    – exhaust value drains fuel
• All 3 systems learned about states and properties of entities:
    – blood is oxygenated when it reaches various body parts
    – blood is not oxygenated when it returns to the heart

BUT

Neither of the systems learnt the temporal and causal relations
  between these events.
 Causal and Temporal Information                                          Problem
                                                                          Data
                                                                          Method
        in Biomedical Texts                                               Results


Biomedical texts are rich in causal and temporal information conveyed
   via various overt temporal markers
        XPD appears to be degraded in wild-
                  type embryos                              A
                    between                             between
        prophase and metaphase of the first              B and C
                   cell division                          after
                      after                                 D
       the onset of zygotic gene expression,       which coincides with
              which coincides with                          E.
         a redistribution of CDK7 from the
             cytoplasm to the nucleus.


                                 B           A     C

                                             D=E
                                                           Problem
             Previous Work                                 Data
                                                           Method
                                                           Results
• (Filatova and Hovy 2001)
   – Use timestamps from newspaper articles
• (Lapata and Lascarides 2004)
   – Only consider sentences with a main and subordinate
     clause
• (Bethard et al. 2008)
   – Only consider sentences where events are conjoined with
     “and”
• (Pustejovsky, Mani and Hobbs 2007; Verhagen 2004) :
  TARSQI
   – Temporal Awareness and Reasoning Systems for
     Questions Interpretation
   – Temporal Ordering Toolkit
                                                                      Problem
                                TARSQI                                Data
                                                                      Method
                                                                      Results

• Current state of the art for event recognition and event ordering
• (Pustejovsky, Mani and Hobbs 2007; Verhagen 2004)
• Use succession of tenses for analyzing temporal relations

Event Recognition for our domain using TARSQI
                                  Recall     Precision
        Non Biological Events     70.60%     97.80%

Event Ordering for our domain using TARSQI

                                   Recall    Precision
        Non Biological Events      47.00%    47.60%
        Including Biological events 26.52%   27.03%
                                                          Problem
                   TARSQI                                 Data
                                                          Method
                                                          Results
Shortcomings WRT scientific texts
• Trained on Newspaper articles
• Event order is often determined by the succession of tenses

e.g.
The progression into G2 phase depends mainly on another
   wave of cyclin production.

TARSQI returns incorrect result:
  progression  production

We want to use temporal markers, e.g. depends on
  Signals that indicate temporal                                              Problem
                                                                              Data
                                                                              Method
            precedence                                                        Results



• Words conveying causality:
   E.g. In Drosophila, inactivation of the CDK7 mutant causes embryonic and
      larval lethality and a block to mitosis in the germline.
      Mutations in the gene encoding XPD lead to dysregulation of nuclear-
      receptor-dependent gene expression.
• Words conveying aspect:
   E.g. In S. pombe, the Wee1 kinase allows entry into mitosis when cells reach
      a minimum threshold size.
• Words conveying dependence:
   E.g. Regulation of cell cycle progression is another general stress response
      critical for cell survival.
• Words conveying control:
   E.g. Dephosphorylation of these residues by the Cdc25 phosphatase is the key
      event governing the initiation of mitosis.
                                                                                             Problem

             Annotation Effort                                                               Data
                                                                                             Method
                                                                                             Results


• What we want to annotate:
    – Events in the text and the causal and temporal relation between those events
      per sentence.
• Step 1: Event Identification:
    – TARSQI captured non-biology specific verbs and event nouns
    – Biology specific events were manually annotated (e.g. transcription, regulation)
• Step 2: Identify relation between events
    – Annotation Tool:
        • RSTTool (Version 3.0) by Michael O’Donnell
        • http://www.wagsoft.com/RSTTool/
        • Tool allowed: text segmentation, text structuring, relation specification, and maintenance
          of statistics for relations.
    – How to Annotate:
        • Fragment sentence into individual fragments that convey a temporal relation
        • Create a temporal link connecting the head of one fragment to the head of the other
• Corpus
    – 190 sentences from Biomedical literature
    – Relations: before, simultaneous
     Inter-Annotator Agreement                                               Problem
                                                                             Data
                                                                             Method
                                                                             Results
2 Annotators
40 Random sentences
215 events in those sentences

Annotator A (BioMed Background): 90 Relations
Annotator B (CS Background): 73 relations

Agreement
   Annotator A was selected as the gold standard.
   Annotator B achieved the precision of 69.8% and recall of 53.3%
   The annotations of each annotator were judged against the transitive
   closure of the other annotator
   RB  RA (if A mentions RB or RB can be derived from all relations by A)
    RA  RB(if B mentions RA or RA can be derived from all relations by B)
Gold Standard
   Created after combining annotations of both annotators after discussions
                   Baseline Calculation                                   Problem
                                                                          Data
                                                                          Method
                                                                          Results
• Baseline 1: (Random)
    – System was given all the events involved in a temporal relation
    – System randomly picked 2 events within a sentence and gave it a random
      relation (Relations: Before, during)
• Baseline 2: (Simple but bad heuristic: Text order reflects event order)
    – System was given all the events involved in a temporal relation
    – System gave a relation to all events in a sentence based on the order of
      occurrence. (X …. Y  X before Y)
                                                   Problem
Approach: Natural Language                         Data
                                                   Method

         Pipeline                                  Results




 Input text    Parser
               (Charniak Parser)




              Logical Form
              (LFToolkit)




              Abductive Engine
                                   Output interpretation
              (Mini-TACITUS)
                                                                                      Problem
      Natural Language Pipeline                                                       Data
                                                                                      Method
                                                                                      Results


E.g. Exposure of cells to stress results in rapid activation of MAPKs.
• Parser: Charniak Parser (Charniak 1999)
        (S1 (S (NP (NP (NN Exposure))
           (PP (IN of) (NP (NNS cells)))
           (PP (TO to) (NP (NN stress))))
           (VP (VBZ results)
            (PP (IN in)
             (NP (NP (JJ rapid) (NN activation))
           (PP (IN of) (NP (NNS MAPKs))))))
           (. .)))

• Logical Form: LFToolkit (Rathod and Hobbs 2005)
   exposure-nn(x2) & of-in(x2,x10) & cell-nn(x10) & to-in(x2,x6) & stress-nn(x6) & result-
      vb’(e0,x2) & in-in(e0,x8) & rapid-adj(x8) & activation-nn(x8) & of-in(x8,x12) &mapks-
      nn(x12)
                                                                            Problem
    Natural Language Pipeline                                               Data
                                                                            Method
                                                                            Results


• Abductive Inference Engine: Mini-TACITUS

 exposure-nn(x2) & …& result-vb’(e0,x2) & in-in(e0,x8) & …& activation-nn(x8) & …


                      BEFORE(x2,x8) & CAUSES(x2,x8)
     Final Logical Form: exposure-nn(x2) & …& BEFORE(x2,x8) &
                         CAUSES(x2,x8) & …& activation-nn(x8) & …
                                                                          Problem
      Axiom Creation Process:                                             Data
                                                                          Method

             Manual                                                       Results



• 66 Axioms (Causal and aspectual words)
• 190 sentences from 2 biomedical articles on the cell cycle
    Causal Axioms
1   CAUSES(x3,e4) & BEFORE(x3,e4)  in-in(e4,x2) & response-nn(x2) & to-in(x2,x3)
2   CAUSES(x0,x8) & BEFORE(x0,x8)  result-vb’(e0,x0) & in-in(e0,x8)
3   CAUSES(x1,x3) & BEFORE(x1,x3)  lead-vb’(e1,x1) & to-in(e1,x3)
4   CAUSES(e4,x3) & BEFORE(e4,x3)  in-in(e4,x2) & order(x2) & to-in(x2,x3)
5   CAUSES(x15,e14) & BEFORE(x15,e14)  upon(e14,x15)
    Aspectual Axioms
6   BEGINS(x1,e2) start-vb’(e1,x1,e2)
7   BEGINS(x1,x2) progression-nn(x1) & into-in(x1,x2)
8   BEGINS(x1,x2) onset-nn(x1) & of-in(x1,x2)
     Results of Axiom                                                              Problem
                                                                                   Data
                                                                                   Method
Implementation on Test Set                                                         Results


100% precision with No Bad Parses           90% of the error is
                                            due to bad parses
                      All                  Causal                   Aspect
                All         NBP      All         NBP          All        NBP
 Precision    67.7%         100%    69.2%       100%        33.3%        100%
   Recall     29.4% 47.05%          64.3%        90%        12.5%       16.7%

                                                                         Many more ways to
                                                                         represent aspectual
                                                                         words
Causal and Temporal Information                      Problem
                                                     Data
                                                     Method
       in Biomedical Texts                           Results




      Good Parse              Bad Parse
           A                        A
        between                between
         B and C                B and C
          after                  after
               D                   D
   which coincides with   which coincides with
            E.                     E.




  B        A       C
                          B       A=E            C
           D=E
                                             D
                                          Problem
                                          Data

                        Completeness      Method
                                          Results


• Indicators of Completeness
   – Precision and recall with no
      bad parses (Precision:
      100%, recall: 47.05%)
   – Graph of Axioms generated
      every 10 sentences
        • Manually created axioms
          incrementally 10 sentences at
          a time for 190 sentences
        • 38 axioms were created in
          the first 50 sentences
        • 7 axioms created in the last
          50 sentences
    – Semi-Automatically extract
      patterns conveying
      temporal information
                                                                              Problem
Semi-Automatic Axiom Creation                                                 Data
                                                                              Method
                                                                              Results


• Question: Can we semi-automatically collect all axioms from the
  text
   – Paraphrasing Algorithm: (Bhagat and Ravichandran 2008), ACL
• Method:
   – Input
       • Causal relations
             – 12 seed words
             – 49 total patters including the verb variations
       • Aspectual Relations
             – 15 seed words
             – 31 total patterns including verb variations
   – Algorithm
       • Patterns of the form X seed Y were extracted from different releases of the
         LDC corpus
       • Find other seeds which occur in the same contexts
• Corpus: LDC corpus containing over 2 billion words
   Semi-Automatic Axiom Creation
• Results for Causal Relations          • Results for Aspectual
   – 308 phrases were returned            Relations
   – 182 were syntactic variations of      – 320 phrases were returned
     the original seed                     – These were classified into106
   – Remaining 126 patterns could            pattern classes
     be collapsed into 100 pattern         – 12 new patterns were
     classes                                 discovered
   – Only 9 pattern classes of the         – 15 classes were variation of
     100 were good , low freq                the seeds
   – 91 patterns were bad (e.g. Y 's       – 79 were bad (e.g. X described
     ambassador said X, X -- held            below moved monday Y)
     northern Y)

 Semi-Automatic Axiom Creation:
    Very few axioms were new over the manual effort
                                     Problem
                                     Data

                     Completeness    Method
                                     Results


• Indicators of Completeness
   – Precision and recall with no
     bad parses (Precision: 100%,
     recall: 47.05%)
   – Graph of Axioms generated
     every 10 sentences
       • Manually created axioms
         incrementally 10
         sentences at a time for
         190 sentences
       • 38 axioms were created in
         the first 50 sentences
       • 7 axioms created in the
         last 50 sentences
• Semi-Automatic Axiom
  Creation
   – Very few axioms were new
     over the manual effort
                                         Problem
        Future Work                      Data
                                         Method
                                         Results



• Improvement of the Natural Language
  Pipeline
• Improvement of annotation guidelines,
  increase inter-annotator agreement
• Explore other classes of words conveying
  causality and temporal relation
                                                           Problem
                                                           Data
                  Conclusions                              Method
                                                           Results



• We can achieve high precision and recall for automatic
  recognition of causal and temporal relations
                             Precision   Recall
            Causal words      100%       90%
           Aspectual words    100%       16.7%

• Total 66 axioms were created
• Studies suggest convergence, but we would like to continue to
  a point where we see convergence