jhu ppt PowerPoint Presentation headache

Document Sample
jhu ppt PowerPoint Presentation headache Powered By Docstoc
					Semantic Relation Detection
    in Bioscience Text

                 Marti Hearst
               SIMS, UC Berkeley
         http://biotext.berkeley.edu
  Supported by NSF DBI-0317510 and a gift from Genentech
BioText Project Goals
   Provide flexible, intelligent access to
    information for use in biosciences
    applications.
   Focus on
       Textual Information from Journal Articles
       Tightly integrated with other resources
            Ontologies
            Record-based databases
                          Project Team
   Project Leaders:
       PI: Marti Hearst
       Co-PI: Adam Arkin
   Computational Linguistics            User Interface / IR
       Barbara Rosario                      Adam Newberger
       Presley Nakov                        Dr. Emilia Stoica
   Database Research                    Bioscience
       Ariel Schwartz                       Dr. TingTing Zhang
                                             Janice Hamerja
       Gaurav Bhalotia (graduated)
                                                        Supported primarily by
                                                        NSF DBI-0317510
                                                        and a gift from
                                                        Genentech
      BioText Architecture


Sophisticated                  Annotations in
Text Analysis                    Database




               Improved
            Search Interface
The Nature of Bioscience Text

   Claim:
   Bioscience semantics are simultaneously
   easier and harder than general text.

        easier                     harder

    Fewer subtleties
                            Enormous terminology
   Fewer ambiguities
                          Complex sentence structure
 “Systematic” meanings
Sample Sentence

 “Recent research, in proliferating cells, has
  demonstrated that interaction of E2F1 with
  the p53 pathway could involve
  transcriptional up-regulation of E2F1 target
  genes such as p14/p19ARF, which affect
  p53 accumulation [67,68], E2F1-induced
  phosphorylation of p53 [69], or direct E2F1-
  p53 complex formation [70].”
     BioScience Researchers

   Read A LOT!
   Cite A LOT!
   Curate A LOT!
   Are interested in specific relations, e.g.:
       What is the role of this protein in that
        pathway?
       Show me articles in which a comparison
        between two values is significant.
This Talk
   Discovering semantic relations
       Between nouns in noun compounds
       Between entities in sentences
   Acquiring labeled data:
       Idea: use text surrounding citations to
        documents to identify paraphrases
            A new direction; preliminary work only
  Noun Compound
Relation Recognition
    Noun Compounds (NCs)
   Technical text is rich with NCs

    Open-labeled long-term study of the subcutaneous
    sumatriptan efficacy and tolerability in acute migraine
    treatment.

   NC is any sequence of nouns that itself
    functions as a noun
       asthma hospitalizations
       health care personnel hand wash
NCs: 3 computational tasks
   Identification
   Syntactic analysis (attachments)
          [Baseline [headache frequency]]
          [[Tension headache] patient]

   Our Goal: Semantic analysis
          Headache treatment        treatment for headache
          Corticosteroid treatment  treatment that uses
                                         corticosteroid
Descent of Hierarchy

   Idea:
       Use the top levels of a lexical hierarchy to
        identify semantic relations
   Hypothesis:
       A particular semantic relation holds
        between all 2-word NCs that can be
        categorized by a lexical category pair.
    Related work
    (Semantic                analysis of NCs)
   Rule-based
       Finin (1980)
            Detailed AI analysis, hand-coded
       Vanderwende (1994)
            automatically extracts semantic information from an on-line
             dictionary, manipulates a set of handwritten rules.           13
             classes, 52% accuracy
   Probabilistic
       Lauer (1995):
            probabilistic model, 8 classes, 47% accuracy
       Lapata (2000)
            classifies nominalizations into subject/object.               2
             classes, 80% accuracy
Related work
(Semantic              analysis of NCs)
   Lexical Hierarchy
       Barrett et al. (2001)
            WordNet, heuristics to classify a NC given the similarity
             to a known NC
       Rosario and Hearst (2001)
            Relations pre-defined
            MeSH, Neural Network. 18 classes, 60% accuracy
 Linguistic Motivation
Can cast NC into head-modifier relation,
and assume head noun has an
argument and qualia structure.

   (used-in): kitchen knife
   (made-of): steel knife
   (instrument-for): carving knife
   (used-on): putty knife
   (used-by): butcher‟s knife
The lexical Hierarchy: MeSH
 1. Anatomy [A]
 2. Organisms [B]
 3. Diseases [C]
 4. Chemicals and Drugs [D]
 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E]
 6. Psychiatry and Psychology [F]
 7. Biological Sciences [G]
 8. Physical Sciences [H]
 9. Anthropology, Education, Sociology and Social Phenomena [I]
10. Technology and Food and Beverages [J]
11. Humanities [K]
12. Information Science [L]
13. Persons [M]
14. Health Care [N]
15. Geographic Locations [Z]
The lexical Hierarchy: MeSH
 1. Anatomy [A]         Body Regions [A01]
 2. [B]                  Musculoskeletal System [A02]
 3. [C]                  Digestive System [A03]
 4. [D]                  Respiratory System [A04]
 5. [E]                  Urogenital System [A05]
 6. [F]                 ……
 7. [G]
 8. Physical Sciences [H]
 9. [I]
10. [J]
11. [K]
12. [L]
13. [M]
 Descending the Hierarchy
 1. Anatomy [A]         Body Regions [A01]               Abdomen [A01.047]
 2. [B]                  Musculoskeletal System [A02]   Back [A01.176]
 3. [C]                  Digestive System [A03]         Breast [A01.236]
 4. [D]                  Respiratory System [A04]       Extremities [A01.378]
 5. [E]                  Urogenital System [A05]        Head [A01.456]
 6. [F]                 ……                                 Neck [A01.598]
 7. [G]                                                     ….
 8. Physical Sciences [H]
 9. [I]
10. [J]
11. [K]
12. [L]
13. [M]
 Descending the Hierarchy
 1. Anatomy [A]         Body Regions [A01]                Abdomen [A01.047]
 2. [B]                  Musculoskeletal System [A02]    Back [A01.176]
 3. [C]                  Digestive System [A03]          Breast [A01.236]
 4. [D]                  Respiratory System [A04]        Extremities [A01.378]
 5. [E]                  Urogenital System [A05]         Head [A01.456]
 6. [F]                 ……                                  Neck [A01.598]
 7. [G]                                                     ….
 8. Physical Sciences [H]      Electronics
 9. [I]                          Astronomy
10. [J]                           Nature
11. [K]                           Time
12. [L]                           Weights and Measures
13. [M]                            ….
 Descending the Hierarchy
 1. Anatomy [A]         Body Regions [A01]                      Abdomen [A01.047]
 2. [B]                  Musculoskeletal System [A02]         Back [A01.176]
 3. [C]                  Digestive System [A03]                Breast [A01.236]
 4. [D]                  Respiratory System [A04]             Extremities [A01.378]
 5. [E]                  Urogenital System [A05]              Head [A01.456]
 6. [F]                 ……                                        Neck [A01.598]
 7. [G]                                                           ….
 8. Physical Sciences [H]      Electronics     Amplifiers
 9. [I]                          Astronomy        Electronics, Medical
10. [J]                           Nature          Transducers
11. [K]                           Time
12. [L]                           Weights and Measures
13. [M]                            ….
 Descending the Hierarchy
 1. Anatomy [A]         Body Regions [A01]                      Abdomen [A01.047]
 2. [B]                  Musculoskeletal System [A02]         Back [A01.176]
 3. [C]                  Digestive System [A03]                Breast [A01.236]
 4. [D]                  Respiratory System [A04]             Extremities [A01.378]
 5. [E]                  Urogenital System [A05]              Head [A01.456]
 6. [F]                 ……                                        Neck [A01.598]
 7. [G]                                                           ….
 8. Physical Sciences [H]      Electronics     Amplifiers
 9. [I]                          Astronomy        Electronics, Medical
10. [J]                           Nature          Transducers
11. [K]                           Time
12. [L]                           Weights and Measures           Calibration
13. [M]                            ….                                Metric System
                                                                     Reference
  Standard
 Descending the Hierarchy
 1. Anatomy [A]         Body Regions [A01]                     Abdomen [A01.047]
 2. [B]                 Musculoskeletal System [A02]         Back [A01.176]
 3. [C]                 Digestive System [A03]                Breast [A01.236]
Homogeneous Respiratory System [A04]
 4. [D]                                                      Extremities [A01.378]
 5. [E]                 Urogenital System [A05]              Head [A01.456]
 6. [F]                 ……                                       Neck [A01.598]
 7. [G]                                                          ….
 8. Physical Sciences [H]     Electronics     Amplifiers
 9. [I]                         Astronomy        Electronics, Medical
10. [J]                          Nature          Transducers
11. [K]                          Time
12. [L] Heterogeneous Weights and Measures                      Calibration
13. [M]                           ….                                Metric System
                                                                    Reference
  Standard
Mapping Nouns to MeSH Concepts

   headache              recurrence
    C23.888.592.612.441   C23.550.291.937


   headache              pain
    C23.888.592.612.441   G11.561.796.444
    Levels of Description

                 headache            pain

   Level 0:    C.23                 G.11
   Level 1:    C23.888              G11.561
   Level 1:    C23.888.592          G11.561.796
   …
   Original:   C23.888.592.612.441 G11.561.796.444
Descent of Hierarchy

   Idea:
       Words falling in homogeneous MeSH
        subhierarchies behave “similarly” with
        respect to relation assignment
   Hypothesis:
       A particular semantic relation holds
        between all 2-word NCs that can be
        categorized by a MeSH category pairs
Grouping the NCs
   CP: A02 C04 (Musculoskeletal System, Neoplasms)
       skull tumors, bone cysts, bone metastases, skull
        osteosarcoma…
   CP: C04 M01 (Neoplasms, Person)
       leukemia survivor, lymphoma patients, cancer physician,
        cancer nurses…
Distribution of
Category Pairs
        Collection
   ~70,000 NCs extracted from titles and abstracts of
    Medline
   2,627 CPs at level 0 (with at least 10 unique NCs)
       We analyzed
            250 CPs with Anatomy (A)
            21 CPs with Natural Science (H01)
            3 CPs with Neoplasm (C04)
       This represents 10% of total CPs and 20% of total NCs
Classification Method
   For each CP
       Divide its NCs into “training-testing” sets
       “Training”: inspect NCs by hand
            Start from level 0 0
            While NCs are not all similar
                descend one level of the hierarchy

            Repeat until all NCs for that CP are similar
    Classification Decisions
   A02 C04
   B06 B06
   C04 M01
      C04 M01.643

      C04 M01.526

   A01 H01
      A01 H01.770

      A01 H01.671

         A01 H01.671.538

         A01 H01.671.868

   A01 M01
      A01 M01.643

      A01 M01.526

      A01 M01.898
    Classification Decisions
    + Relations
   A02 C04  Location of Disease
   B06 B06  Kind of Plants
   C04 M01
      C04 M01.643  Person afflicted by Disease

      C04 M01.526  Person who treats Disease

   A01 H01
      A01 H01.770

      A01 H01.671

         A01 H01.671.538

         A01 H01.671.868

   A01 M01
      A01 M01.643

      A01 M01.526

      A01 M01.898
    Classification Decisions
    + Relations
   A02 C04  Location of Disease
   B06 B06  Kind of Plants
   C04 M01
      C04 M01.643  Person afflicted by Disease

      C04 M01.526  Person who treats Disease

   A01 H01
      A01 H01.770

      A01 H01.671

         A01 H01.671.538

         A01 H01.671.868

   A01 M01
      A01 M01.643  Person afflicted by Disease

      A01 M01.526

      A01 M01.898
Classification Decision Levels
     Anatomy: 250 CPs
         187 (75%) remain first level
          56 (22%) descend one level
           7 (3%) descend two levels

     Natural Science (H01): 21 CPs
          1 ( 4%) remain first level
          8 (39%) descend one level
         12 (57%) descend two levels

     Neoplasms (C04) 3 CPs:
         3 (100%) descend one level
Evaluation
   Test the decisions on “testing” set
   Count how many NCs that fall in the groups
    defined in the classification decisions are
    similar to each other
   Accuracy (for 2nd noun):
       Anatomy:         91%
       Natural Science: 79%
       Neoplasm:       100%
   Total Accuracy : 90.8%
   Generalization: our 415 classification
    decisions cover ~ 46,000 possible CP pairs
Ambiguity – Two Types

   Lexical ambiguity:
       mortality
           state of being mortal
           death rate
   Relationship ambiguity:
       bacteria mortality
           death of bacteria
           death caused by bacteria
                    Four Cases
Single MeSH senses                Multiple MeSH senses


Only one possible relationship:   Only one possible relationship:
abdomen radiography, aciclovir    alcoholism treatment
treatment



Multiple relationships:           Multiple relationships
hospital databases, education     bacteria mortality
efforts, kidney metabolism



                                                         Ambiguity of
                                                         relationship
                    Four Cases
Single MeSH senses                Multiple MeSH senses


Only one possible relationship:   Only one possible relationship:
abdomen radiography, aciclovir    alcoholism treatment
treatment



Multiple relationships:           Multiple relationships
hospital databases, education     bacteria mortality
efforts, kidney metabolism


                Most problematic cases                   Ambiguity of
                … but rare!                              relationship
Conclusions on
NN Relation Classification

   Very simple method for assigning semantic
    relations to two-word technical NCs
       90.8% accuracy
   Lexical resource (MeSH) useful for this task
   Probably works because of the relative lack of
    ambiguity in this kind of technical text.
   Entity-Entity
Relation Recognition
  Problem: Which relations
  hold between 2 entities?

              Cure?

             Prevent?

Treatment                  Disease
            Side Effect?
        Hepatitis Examples

   Cure
       These results suggest that con A-induced hepatitis
        was ameliorated by pretreatment with TJ-135.
   Prevent
       A two-dose combined hepatitis A and B vaccine
        would facilitate immunization programs
   Vague
       Effect of interferon on hepatitis B
               Two tasks

   Relationship Extraction:
       Identify the several semantic relations that
        can occur between the entities disease and
        treatment in bioscience text


   Entity extraction:
       Related problem: identify such entities
            The Approach

   Data: MEDLINE abstracts and titles
   Graphical models
       Combine in one framework both relation
        and entity extraction
       Both static and dynamic models
   Simple discriminative approach:
       Neural network
   Lexical, syntactic and semantic features
Related Work
    We allow several DIFFERENT relations
     between the same entities
        Thus differs from the problem statement of other
         work on relations
    Many find one relation which holds between
     two entities (many based on ACE)
        Agichtein and Gravano (2000), lexical patterns for location
         of
        Zelenko et al. (2002) SVM for person affiliation and
         organization-location
        Hasegawa et al. (ACL 2004) Person-Organization ->
         President “relation”
        Craven (1999, 2001) HMM for subcellular-location and
         disorder-association
             Doesn‟t identify the actual relation
Related work: Bioscience

   Many hand-built rules
       Feldman et al. (2002),
       Friedman et al. (2001)
       Pustejovsky et al. (2002)
       Saric et al.; this conference
        Data and Relations
   MEDLINE, abstracts and titles
   3662 sentences labeled
       Relevant: 1724
       Irrelevant: 1771
            e.g., “Patients were followed up for 6 months”
   2 types of Entities, many instances
       treatment and disease
   7 Relationships between these entities
Semantic Relationships
   810: Cure
       Intravenous immune globulin for recurrent
        spontaneous abortion
   616: Only Disease
       Social ties and susceptibility to the common cold
   166: Only Treatment
       Flucticasone propionate is safe in recommended
        doses
   63: Prevent
       Statins for prevention of stroke
    Semantic Relationships
   36: Vague
       Phenylbutazone and leukemia
   29: Side Effect
       Malignant mesodermal mixed tumor of the
        uterus following irradiation
   4: Does NOT cure
       Evidence for double resistance to
        permethrin and malathion in head lice
                     Features
   Word
   Part of speech
   Phrase constituent
   Orthographic features
       „is number‟, „all letters are capitalized‟, „first letter
        is capitalized‟ …
   MeSH (semantic features)
       Replace words, or sequences of words, with
        generalizations via MeSH categories
            Peritoneum -> Abdomen
               Models

   2 static generative models
   3 dynamic generative models
   1 discriminative model (neural network)
    Static Graphical Models
   S1: observations dependent on Role but
    independent from Relation given roles
   S2: observations dependent on both
    Relation and Role




    S1                     S2
         Dynamic Graphical Models

   D1, D2 as in
    S1, S2
   D3: only one                D1
    observation per
    state is
    dependent on
    both the                    D2
    relation and the
    role


                                D3
         Graphical Models
   Relation node:
       Semantic relation (cure, prevent, none..)
        expressed in the sentence
            Graphical Models
   Role nodes:
       3 choices: treatment, disease, or none
         Graphical Models
   Feature nodes (observed):
       word, POS, MeSH…
               Graphical Models

   For Dynamic Model D1:
       Joint probability distribution over relation, roles and
        features nodes
        P(Rela, Ro 0,..,RoleT, f 10 ,.., fnT)  P(Rela)P(R 0| Rela)
                 le                                      ole
                T                           n

                P(Role | Role , Rela)  P(f
                t 1
                          t      t-1
                                            j 1
                                                   jt   | Rolet)



       Parameters estimated with maximum likelihood and
        absolute discounting smoothing
          Neural Network

   Feed-forward network (MATLAB)
       Training with conjugate gradient descent
       One hidden layer (hyperbolic tangent
        function)
       Logistic sigmoid function for the output
        layer representing the relationships
   Same features
   Discriminative approach
                 Role extraction
   Results in terms of F-measure
   Graphical models
       Junction tree algorithm (BNT)
       Relation hidden and marginalized over
   Neural Net
       Couldn‟t run it (features vectors too large)

   (Graphical models can do role extraction and
    relationship classification simultaneously)
  Role Extraction: Results
F-measures
D1 best when no smoothing
 Role Extraction: Results
F-measures
D2 best with smoothing, but doesn‟t boost
  scores as much as in relation classification
    Role Extraction: Results
Static models better than Dynamic for




Note: No Neural Networks
       Relation classification:
               Results
With Smoothing and Roles, D1 best GM
Features impact: Role Extraction

       Most important features: 1)Word, 2)MeSH

       Models           D1     D2
       All features   0.67    0.71
       No word         0.58   0.61
                   -13.4% -14.1%
       No MeSH        0.63    0.65
                   -5.9%   -8.4%

                                      (rel. + irrel.)
                        Features impact:
                      Relation classification
   Most important features: Roles

   Accuracy:                    D1      D2     NN
   All feat. + roles           91.6    82.0   96.9
   All feat. – roles           68.9    74.9   79.6
                         -24.7% -8.7% -17.8%
   All feat. + roles – Word 91.6       79.8   96.4
                          0%    -2.8% -0.5%
   All feat. + roles – MeSH 91.6       84.6   97.3
                          0%     3.1%   0.4%
    (rel. + irrel.)
              Relation extraction
   Results in terms of classification accuracy (with
    and without irrelevant sentences)
   2 cases:
        Roles hidden
        Roles given
   Graphical models
          ^
        Rela  argmax P(Relak, Role0 ,...,Role, f 10 ,.., fnT)
                                             T
                  Rela k



   NN: simple classification problem
     Relation classification:
             Results
Neural Net always best
        Relation classification:
                Results
With Smoothing and No Roles, D2 best GM
        Relation classification:
                Results
Dynamic models always outperform Static
       Relation classification:
               Results
With no smoothing, D1 best Graphical Model
       Relation classification:
         Confusion Matrix




Computed for the model D2, “rel + irrel.”, “only features”
                         Features impact:
                       Relation classification
   Most realistic case: Roles not known
   Most important features: 1) Mesh 2) Word for D1 and
    NN (but vice versa for D2)

   Accuracy:                 D1      D2      NN
   All feat. – roles        68.9     74.9    79.6
   All feat. - roles – Word 66.7     66.1    76.2
                                -3.3% -11.8% -4.3%
   All feat. - roles – MeSH 62.7      72.5   74.1
     (rel. + irrel.)            -9.1% -3.2% -6.9%
             Relation Recognition:
                  Conclusions
   Classification of subtle semantic relations in
    bioscience text
       Discriminative model (neural network) achieves high
        classification accuracy
       Graphical models for the simultaneous extraction of
        entities and relationships
       Importance of lexical hierarchy

   Next Step:
       Different entities/relations
       Semi-supervised learning to discover relation types
Acquiring Labeled Data
    using Citances
A discovery is made …




                        A paper is written …
That paper is cited …

                                 and cited …



                                                   and cited …




                        … as the evidence for some fact(s) F.
Each of these in turn are cited for some fact(s) …




                      … until it is the case that all important
                      facts in the field can be found in citation
                      sentences alone!
    Citances
   Nearly every statement in a bioscience journal article is
    backed up with a cite.
   It is quite common for papers to be cited 30-100 times.
   The text around the citation tends to state biological
    facts. (Call these citances.)

   Different citances will state the same facts in different
    ways …
    … so can we use these for creating models of language
    expressing semantic relations?
    Using Citances
   Potential uses of citation sentences (citances)
       creation of training and testing data for semantic
        analysis,
       synonym set creation,
       database curation,
       document summarization,
       and information retrieval generally.

   Some preliminary results:
       Citances to a document align well with a hand-built
        curation.
       Citances are good candidates for paraphrase creation.
Citances for Acquiring
Examples of Semantic Relations
   A relationship type R between entities of type
    A and B can be expressed in many ways.
   Use citances to build a model the different
    ways to express the relationship:

       Seed learning algorithms with examples that
        mention A and B, for which relation R holds.
       Train a model to recognize R when the relation is
        not known.
       Results may extend to sentences that are not
        citances as well.
Issues for Processing Citances
   Text span
       Identification of the appropriate phrase, clause, or
        sentence that constructs a citance.
       Correct mapping of citations when shown as lists
        or groups (e.g., “[22-25]”).
   Grouping citances by topic
       Citances that cite the same document should be
        grouped by the facts they state.
   Normalizing or paraphrasing citances
       For IR, summarization, learning synonyms,
        relation extraction, question answering, and
        machine translation.
Related Work
   Traditional citation analysis dates back to the
    1960‟s (Garfield). Includes:
       Citation categorization,
       Context analysis,
       Citer motivation.
   Citation indexing systems, such as ISI‟s SCI,
    and CiteSeer.
       Mercer and Di Marco (2004) propose to improve
        citation indexing using citation types.
       Bradshaw (2003) introduces Reference Directed
        Indexing (RDI), which indexes documents using
        the terms in the citances citing them.
    Related Work (cont.)
   Teufel and Moens (2002) identify citances to
    improve summarization of the citing paper..
   Nanba et. al. (2000) use citances as features
    for classifying papers into topics.
   Related field to citation indexing is the use of
    link structure and anchor text of Web pages.
       Applications include: IR, classification, Web
        crawlers, and summarization.
Example: protein-protein
Early results:
Paraphrase Creation from Citances
Sample Sentences
   NGF withdrawal from sympathetic neurons induces
    Bim, which then contributes to death.

   Nerve growth factor withdrawal induces the
    expression of Bim and mediates Bax dependent
    cytochrome c release and apoptosis.

   The proapoptotic Bcl-2 family member Bim is strongly
    induced in sympathetic neurons in response to NGF
    withdrawal.

   In neurons, the BH3 only Bcl2 member, Bim, and JNK
    are both implicated in apoptosis caused by nerve
    growth factor deprivation.
Their Paraphrases
   NGF withdrawal induces Bim.
   Nerve growth factor withdrawal induces the
    expression of Bim.
   Bim has been shown to be upregulated
    following nerve growth factor withdrawal.
   Bim implicated in apoptosis caused by nerve
    growth factor deprivation.

They all paraphrase:
     Bim is induced after NGF withdrawal.
Paraphrase Creation
Algorithm
1. Extract the sentences that cite the target.
2. Mark the NEs of interest (genes/proteins, MeSH terms)
and normalize.
3. Dependency parse (MiniPar).
4. For each parse
   For each pair of NEs of interest
       i. Extract the path between them.
       ii. Create a paraphrase from the path.
5. Rank the candidates for a given pair of NEs.
6. Select only the ones above a threshold.
7. Generalize.
Creating
a Paraphrase
  Given the path from the dependency parse:
    Restore the original word order.
    Add words to improve grammaticality.

      •   Bim … shown … be … following nerve growth
          factor withdrawal.

      •   Bim [has] [been] shown [to] be [upregulated]
          following nerve growth factor withdrawal.
2-word Heuristic
Demonstration

   NGF withdrawal induces Bim.
   Nerve growth factor withdrawal induces [the]
    expression of Bim.
   Bim [has] [been] shown [to] be
    [upregulated] following nerve growth factor
    withdrawal.
   Bim [is] induced in [sympathetic] neurons in
    response to NGF withdrawal.
   member Bim implicated in apoptosis caused
    by nerve growth factor deprivation.
    Evaluation (1)
   An influential journal paper from Neuron:
       J. Whitfield, S. Neame, L. Paquet, O. Bernard, and J. Ham.
        Dominantnegative c-jun promotes neuronal survival by
        reducing bim expression and inhibiting mitochondrial
        cytochrome c release. Neuron, 29:629–643, 2001.
   99 journal papers citing it
   203 citances in total
   36 different types of important biological
    factoids
       But we concentrated on one model sentence:
         “Bim is induced after NGF withdrawal.”
    Evaluation (2)
   Set 1: 67 citances pointing to the target paper
    and manually found to contain a good or
    acceptable paraphrase (do not necessarily
    contain Bim or NGF);
       (Ideal conditions)

   Set 2: 65 citances pointing to the target paper
    and containing both Bim and NGF;
   Set 3: 102 sentences from the 99 texts,
    containing both Bim and NGF
       (Do citances do better than arbitrarily chosen
        sentences?)
Correctness (Judgments)
   Bad (0.0), if:
       different relation (often phosphorylation aspect);
       opposite meaning;
       vagueness (wording not clear enough).
   Acceptable (0.5), If it was not Bad and:
       contains additional terms (e.g., DP5 protein) or
        topics (e.g., PPs like in sympathetic neurons);
       the relation was suggested but not definitely.
   Else Good (1.0)
Results
   Obtained 55, 65 and 102 paraphrases for sets
    1, 2 and 3
   Only one paraphrase from each sentence
      comparison of the dependency path to that of the model
      sentence




               % - good (1.0) or acceptable (0.5)
Correctness (Recall)
   Calculated on Set 1
   60 paraphrases (out of 67 citances)
   5 citances produced 2 paraphrases
   system recall: 55/67, i.e. 82.09%
   10 of the 67 relevant in Set 1 initially
    missed by the human annotator
       8 good,
       2 acceptable.
   human recall is 57/67, i.e. 85.07%
    Misses
   Sample system miss (no NGF):
       Growth factor withdrawal was shown to cause
        increased Bim expression in various populations of
        neuronal cell types.
   Sample human miss:
       The precise targets of c-Jun necessary for the
        induction of apoptosis have been the subject of
        intense interest and recently, Bim and Dp5, both
        “BH3-domain only” family members, have been
        identified as pro-apoptotic genes induced in a c-
        Jun-dependent manner in both sympathetic
        neurons subjected to NGF withdrawal and in
        cerebellar granule cells deprived of KCl.
Grammaticality
   Missing coordinating “and”:
       “Hrk/DP5 Bim [have] [been] found [to] be
        upregulated after NGF withdrawal”
   Verb subcategorization
       “caused by NGF role for Bim”
   Extra subject words
       member Bim implicated in apoptosis caused by
        NGF deprivation
       sentence: “In neurons, the BH3-only Bcl2
        member, Bim, and JNK are both implicated in
        apoptosis caused by NGF deprivation.”
Related Work
   Word-level paraphrases. Grefenstette uses a
    semantic parser to compare the distributional
    similarity of local contexts for synonyms extraction.
   Phrase-level paraphrases. Barzilay&McKeown use
    POS information from the local context and co-
    training.
   Template paraphrases. Lin&Pantel apply the idea
    of Grefenstette to dependency tree paths. Later
    refined by Shinyama&al.
   Sentence-level paraphrases. Barzilay&Lee use
    multiple sequence alignment. Pang&al. merge parse
    trees into a transducer.
    Relevant Papers
   Citances: Citation Sentences for Semantic Analysis of
    Bioscience Text, Preslav Nakov, Ariel Schwartz, and Marti
    Hearst, in the SIGIR'04 workshop on Search and Discovery in
    Bioinformatics.

   Classifying Semantic Relations in Bioscience Text,
    Barbara Rosario and Marti Hearst, in ACL 2004.

   The Descent of Hierarchy, and Selection in Relational
    Semantics, Barbara Rosario, Marti Hearst, and Charles
    Fillmore, in ACL 2002.
     Thank you!


      Marti Hearst
    SIMS, UC Berkeley
http://biotext.berkeley.edu
Additional slides
       Thompson et al. 2003
    Frame classification and role
labeling for FrameNet sentences
  Target word must be observed
         More relations and roles




                        Our D1
        Smoothing: absolute
           discounting
   Lower the probability of seen events by
    subtracting a constant from their count
    (ML estimate:         P (e)
                                  c(e)
                              )
                                  c(e)
                                     ML

                                                 e

   The remaining probability is evenly
    divided by the unseen events
           Pad (e)  PML(e)     if PML(e)  0
                                if PML(e)  0

                                                 c(seen events)
                                          
                                               c(UN seen events)
F-measures for role extraction in
  function of smoothing factors
Relation accuracies in function
    of smoothing factors

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:28
posted:1/26/2011
language:English
pages:104