Docstoc

Information Extraction in Biolog

Document Sample
Information Extraction in Biolog Powered By Docstoc
					Information Extraction in
        Biology
               Junichi Tsujii
              GENIA Project
(http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/)
             Computer Science
            University of Tokyo
   Overview of GENIA Project
                                                 ③ GENIA
                             ② query   Information Extraction      Learning
                                                                 Terminology
                                                                 Databases
  ① A researcher with a question           •Pre‐processing



                                           •Named entity



                                           •Template element
                                                                       Corpora


⑤ answer to the question                   •Scenario template     Ontology

        ④ information extracted
                                               WWW Links         Thesaurus

                                                      Information Retrieval
                            Overview of GENIA System
                                                                                 Retrieval Module
   Corpus Module                                                                 •Request enhancement
                                           Information Extraction                •Spawn request                    MEDLINE
      •Markup generation / compilation     Module                                •Classify documents
      •Annotated corpus construction
                                            •Identify & classify terms
                                            •Identify events
               Corpus


   Raw(OCR)       Text         Annotated                                  Interface Module
                Structure                                                                                           User
                                                                          •GUI
                                                                          •HTML conversion                          •IR Request
                                                                          •System integration                       •Abstract
                                                                                                                    •Full Paper

                                                       Database
  Background Knowledge

                                               Document Named-Entity     Event
    Ontology    Markup      Data model                                                                  Security
               language
                                            Database Module
Concept Module                                •DB design / access / management
•BK design / construction / compilation       •DB construction
                  Objectives
• What should be extracted ?
   – Ontology for Fact Data bases and Ontology for NLP
   – Linking texts with Fact Data Bases
• Information extraction from texts
   – Named Entity Recognition
   – Event Recognition
• Resource building
   – Knowledge of the Domain
   – Representation Language: Lattice-based Types
                  Objectives
• What should be extracted ?
   – Ontology for Fact Data bases and Ontology for NLP
   – Linking texts with Fact Data Bases
• Information extraction from texts
   – Named Entity Recognition
   – Event Recognition
• Resource building
   – Knowledge of the Domain
   – Representation Language: Lattice-based Types
Target Definition of Information Extraction

 Examples of Existing Data Base

   Lack of Explicit Ontology
   Flat, non-structured collection of Data

Ontology for Data Bases and Knowledge Bases
                 CSNDB
 (National Institute of Health Sciences)

• A data- and knowledge- base for signaling
  pathways of human cells.
   – It compiles the information on biological molecules,
     sequences, structures, functions, and biological
     reactions which transfer the cellular signals.
   – Signaling pathways are compiled as binary
     relationships of biomolecules and represented by
     graphs drawn automatically.
   – CSNDB is constructed on ACEDB and inference
     engine CLIPS , and has a linkage to TRANSFAC.
   – Final goal is to make a computerized model for various
     biological phenomena.
                     Example. 1

• A Standard Reaction             Excerpted @[Takai98]


Signal_Reaction:
    “EGF receptor  Grb2”
From_molecule “EGF receptor”
To_molecule “Grb2”
Tissue “liver”
Effect “activation”
Interaction
    “SH2+phosphorylated Tyr”
Reference [Yamauchi_1997]
                        Example. 3
• A Polymerization Reaction          Excerpted @[Takai98]

Signal_Reaction:
    “Ah receptor + HSP90 ”
Component “Ah receptor” “HSP90”
Effect “activation dissociation”
Interaction
    “PAS domain of Ah receptor”
Activity
    “inactivation of Ah receptor”
Reference [Powell-Coffman_1998]
       Characteristics of Signal Pathway (1)

                                           • Granularity of
                                             Knowledge Units
                                             Different types of entities
                                              which are interrelated
                                              with each other
                                             Cells, Sub-locations of cells
                                             Proteins, substructures of proteins,
                                             Subclasses of proteins
                                             Ions, other chemical substances
                                             Genes, RNA, DNA

G-protein coupled receptor pathway model
figure from TRANSPATH
      Characteristics of Signal Pathways (2)

                                                       Incomplete Knowledge
                                                       of Interactions



                                                            Interpretations depend on
                                                            background knowledge and
                                                            contexts

http://www.mips.biochem.mpg.de/proj/yeast/pathways/pherom
                        one.html
               Structured Representation
    Fukuda (CBRC, AIST) and Takagi (IMS, University of Tokyo)
    CBRC:Computational Biology Research Center
    AIST: National Institute of Advanced Industrial Science and Technology


• Compound Graph)
                                                  H           E           I         J

      H                                       B           F           G
                  B                                                            interaction graph
G                                        A            C           D
                               E                                               G=(V,EG)
       A
                  C                                           root
           D                                      H           E           I        J
F
                                             B            F           G
                                                                              decomposition tree
                               I        A         C           D               T=(V,ET)
J
Compound Graph for Pheromone Signal Transduction Pathway
     G-protein coupled receptor complex

       G protein
                                                                          ligand
        G protein              G protein

         Beta    Gamma         Beta    Gamma        GPCR
         STE4     STE18        STE4     STE18
            Alpha                 Alpha
            GPA1                  GPA1
                                                                                    STE5 MAPK scaffold structure

                                                                        Non phos       Phos
                                                                                                  STE11/MKKK
                                 Beta      Gamma           STE5          STE11        STE11
     CDC42                       STE4      STE18
                                                                                                  STE7/MKK
      GTP
                       STE24                                            Non phos      Phos
                                                                         STE7         STE7

      CDC42
                                           STE20/                       Non phos       Phos     MAPK
       GDP                     BEM1
                                           MKKKK                         FUS3          FUS3

                                                                        Non phos       Phos     MAPK
                                                                         KSS1          KSS1


                                                              Transcription of
                                                                                           STE12
                                                            Mating-specific genes
   Cell polarization

                                                              Cell cycle arrest            FAR1
     Signal Ontology
                                  GEST / Signal XML           Compound Graph
•   Molecular Function
     – Receptor                 <interaction edge _id=‘8’>
          • G-protein coupled     <interact_edge_type>
                                                                     H
            receptor
          • Receptor S/T kinase
                                    covalent_modification
                                    </interact_edge_type>     G             B
          • …..                   <reac_ontl_ID>                     A                     E
     – Enzyme                       S/T phosphorylaytion                    C
•
     – …….
    Cellular Function
                                    </reac_ontl_ID>
                                  <source>G-kinase</source>   F       D
     – Stress Response            <target>Ras</target>
          • Heat shock            <experiment_db_ID>
            response
          • Oxidative stress
                                    3324</experiment_db_ID>    J                       I
            respons
                                  <graphics>
                                   <arrow> …
                                                                                Inference
    Controlled Vocabulary              Schema Definition
                                                               Signal Diagram
      Template of the Entries                                            ANP



                                    Signal DB                          ANP
                                                                     receptor
                                                                                      exterior

                                                                                     membrane
            Gene/Gene Product                 XML Database           guanylyl
                                                                     cyclase           cytosol

                                                                   GTP               cGM
                                                                                      P

                                                                                G-kinase

                                                                      Ras
                                                      Controlled Presentation              PS/T
              SIGNAL-ONTOLOGY
               ontology for cell signaling
 SIGNAL MODULE an unit of signal processing in common to the
  model species
 MOLECULAR FUNCTION biochemical properties of a molecule
 CELLULAR FUNCTION a biological response performed by a set
  of molecules
 REACTION biochemical properties of a signaling reaction
 MOLECULE
 TISSUE
                     general in genome ontologies
 CELL
 SPECIES                        ~500 Terms
                                 Terms are linked to Gene Ontology
                Interface Representation

        from/Extracted               Defined by



Texts                                         Data and Knowledge bases

          Linguistic interface   Knowledge Interface




                                              Data and Knowledge bases
Texts




             Thesaurus              Ontology for Knowledge
                  Objectives
• What should be extracted ?
   – Ontology for Fact Data bases and Ontology for NLP
   – Linking texts with Fact Data Bases
• Information extraction from texts
   – Named Entity Recognition
   – Event Recognition
• Resource building
   – Knowledge of the Domain
   – Representation Language: Lattice-based Types
        Difficulties in IE in Biology

From the linguistic processing point of view
  (1)Problem: Syntactic Variations
              ACTIVATOR activate ACTIVATEE


RAF6 activates NF-kappaB.
Lck is activated by autophosphorylation at Tyr 394.
Anandamide induces vasodilation by activating vanilloid
  receptors.
the activation of Rap1 by C3G
the GTPase-activating protein rhoGAP
the stress-activated group of MAP kinases
 (2)Embedded Relations between Events

An active phorbol ester must therefore, presumably
by activation of protein kinase C, cause dissociation
of a cytoplasmic complex of NF-kappa B and I kappa B
by modifying I kappa B.


E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

E3: It dissociates a cytoplasmic complex of NF-kappa B
   and I kappa B.
                                    Part-Whole
             (3) Uncertain and Negative Information
Example ①
      Activation_Event:
               EID:=95080245.2
               Protein_1:=“IFN alpha”
                        Domain_1:= f
               Protein_2:=“STAT1”,”STAT2”,”STAT3”
                        Domain_2:= f
               Location:=“T cells”
               Definiteness:=definite
               Finding:=new
               Reaction_Type:= f
               Reaction_Path:=direct
               Mode:=affirmative
               SynSet:=<”T cells”,”human peripheral blood-
                        derived T cells”>


      Example: “IFN alpha activated STAT1,STAT2 and STAT3 in T cells, but
      no detectable activation of these STATs was induced by IL-2.”
              (3) Uncertain and Negative Information
Example ②
      Activation_Event:
               EID:=95080245.4
               Protein_1:=“IL-2”
                        Domain_1:= f
               Protein_2:=“these STATs”
                        Domain_2:= f
               Location:=“T cells”
               Definiteness:=tentative
               Finding:=new
               Reaction_Type:= f
               Reaction_Path:=direct
               Mode:=negative
               SynSet:=<“T cells”,”human peripheral blood-
                        derived T cells“>,<”these STATs”
                        ”STAT1,STAT2,STAT3”>


    Example: “IFN alpha activated STAT1,STAT2 and STAT3 in T cells, but
    no detectable activation of these STATs was induced by IL-2.”
      IE System Using a Full Parser
 • Extraction of argument structures by applying a
   domain-independent (HPSG) parser plus a small
   number of domain-dependent patterns on the
   structures
Document
                                           Pattern on Argument

                             HPSG Parser
                                                 Structure


                                                         Information
          Result of Experiments
       Argument Frame Extractor
(PSB2001, A.Yakushiji, et.al.:GENIA Project)
   133 argument structures, marked by a domain specialist
      in 97 sentences among the 180 sentences


       Extracted Uniquely                  31
      Extracted with ambiguity             32         68%
               Extractable from pp’s       26
      Parsing Not extractable              27
      Failures
               Memory limitation,etc       17
 Actual System Configuration
                               a.   Chunking of domain-
            document                dependent terms
                               b.   Reducing the number of
                                    lexical entries from
                                    information given by the
A) Term recognizer                  shallow parser
                B) Shallow parser

                               STRING : “proteolytic enzymes”
  Lexical Entry Generator      LEX_ WORD : “ENZYMEs”
                               INFOS : POS : “N”
Lexical Entries (for parser)           TEMPLATES : [ “3pl” ]
Difficulties in NE
Task difficulties in molecular-biology

•Inconsistent naming conventions
     e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2

•Wide-spread synonymy
     Many synonyms in wide usage, e.g. PKB and Akt

•Open, growing vocabulary for many classes

•Cross-over of names between classes depending on context

•Frequent uses of coordination inside term formations
                   Coordination in term formations

In this report we present evidence that the cell line NK3.3 derivedfrom human NK cells,

responds to both IL-2 and IL-12, as measured byincreases in IFN-gamma and

granulocyte-macrophage colony stimulating factor (GM-CSF) cytoplasmic mRNA

and protein expression.




IFN-gamma cytoplasmic mRNA expression
Granulocyte-macrophage colony stimulating factor(GM-CSF) cytoplassmic mRNA expression

IFN-gamma protein expression
Granulocyte-macrophage colony stimulating factor(GM-CSF) protein expression
        Task difficulties in molecular-biology

•Differences between traditional NE and term extraction
     •Orthographic form

           e.g. a mixture of upper-case letters and numerals is a

           strong indication of a protein or DNA entity.

     •Terms have internal structure:

           “IL-2” is a protein

           “IL-2 receptor alpha chain promoter ” is a DNA.

     •Term meaning is more contextually dependent:

           “IL-2” is a protein, but in some contexts,

           “..IL-2 promoter elements <DNA>IL-2A</DNA>”..

           Role names like “Ah Receptor” can be used as protein names.
              Model’s intuition

                                           Class states
                                                     protein

                                                     DNA
Start of sentence                                                                     End of sentence
                                                     Source.ct


                                                     UNK




  Example:

  Activation of JAK kinases and STAT proteins in human T lymphocytes .

  UNK UNK PROTEIN PROTEIN UNK PROTEIN PROTEIN UNK SOURCE.ct SOURCE.ct SOURCE.ct UNK

  Underlying process:
         Interpolating HMM model specification

Character features:
                      Code   Feature         Example
                      dig    DigitNumber                           15
                      sin    SingleCapital   M
                      grk    GreekLetter     alpha
                      cad    CapsAndDigits I2
                      cap    AtLeastTwoCapsRalGDS
                      lad    LettersAndDigitsil2
                      fst    FirstWord       (first word in sentence)
                      ini    InitCap         Interleukin
                      lcp    LowerCaps       kappaB
                      low    LowerCase       kinases
                      hyp    Hyhon           -'
                      opp    OpenParentheses (
                      clp    CloseParentheses)
                      fsp    FullStop        .
                      cma    Comma           ,
                      pct    Percent         %
                      osq                    [
                             OpenSquareBrackets
                      csq                    ]
                             CloseSquareBrackets
                      cln    Colon           :
                      scn    SemiColon       ;
                      det    Determiner      the
                      con    Conjunction     and
                      oth    Other           *,+,#,@
         Overcoming data sparseness with interpolation

    Model                       Class states


    l0      Start of sentence                            End of sentence


                                Class states


    l1      Start of sentence                            End of sentence


+                               Class states


    l2      Start of sentence                            End of sentence


                                Class states


    l4      Start of sentence                            End of sentence
        Results for HMM
        (Coling 2000, N.Collier, et.al.: GENIA Project)
             Class              #         Base     Base (no features)
             Protein            2125      0.759    0.670 (-11.7%)
             DNA                358       0.472    0.376 (-20.3%)
             RNA                30        0.025    0.000 (-100.0%)
             Source (all)       799       0.685    0.697 (+1.8%)

             Source.cl          93        0.478    0.503    (+5.2%)
             Source.ct          417       0.708    0.752    (+6.2%)
             Source.mo          21        0.200    0.311    (+55.5%)
             Source.mu          64        0.396    0.402    (+1.5%)
             Source.vi          90        0.676    0.713    (+5.5%)
             Source.sl          77        0.540    0.549    (+1.7%)
             Source.ti          37        0.206    0.216    (+4.9%)

             All classes        3312      0.728    0.651    (-10.6%)

             Table 1: F-score values for 5-fold cross-validation


F-score = (2 x Precision x Recall) / (Precision + Recall)
        Results for Decision tree
        (NLPRS 1999, C.Nobata, et.al.: GENIA Project)


   Class              NE                 Classification     Identification
                      task               only               only

   All                69.07              89.56              64.56
   SOURCE             60.10              87.27              -
   PROTEIN            73.16              92.26              -
   DNA                42.65              77.03              -
   RNA                21.62              63.64              -

             Table 1: F-score values for 5-fold cross-validation

F-score = (2 x Precision x Recall) / (Precision + Recall)
                  Objectives
• What should be extracted ?
   – Ontology for Fact Data bases and Ontology for NLP
   – Linking texts with Fact Data Bases
• Information extraction from texts
   – Named Entity Recognition
   – Event Recognition
• Resource building
   – Knowledge of the Domain
   – Representation Language: Lattice-based Types
         Resource Building

            Annotated Corpus
            Linguistic ontology

(http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/)
               GENIA ontology
            (Ontology of Substances)
+-name-+-source-+-natural-+-organism-+-multi-cell organism
       |        |         |           +-mono-cell organism
       |        |         |           +-virus
       |        |         +-tissue
       |        |         +-cell type
       |        |         +-sub-location of cells
       |        +-artificial-+-cell line
       |
       +-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group
                                         |       |         +-protein complex
                                         |       |         +-individual protein molecule
                                         |       |         +-subunit of protein complex
                                         |       |         +-substructure of protein
                                         |       |         +-domain or region of protein
                                         |       +-peptide
                                         |       +-amino acid monomer
                                         |
                                         +-nucleic-+-DNA-+-DNA family or group
                                                   |     +-individual DNA molecule
                                                   |     +-domain or region of DNA
                                                   |
                                                   +-RNA-+-RNA family or group
                                                         +-individual RNA molecule
                                                         +-domain or region of RNA
  Extension of Substance Ontology
• Many terms all in MEDLINE abstracts which constitute
  biological knowledge are not in substance ontology.
  We need to extend the ontology to cover broader
  ranges of terms
• [Method] Language-based ontology building:Finding
  frequent verbs and examining what classes of
  arguments they take
    Expansion of GENIA Ontology
• Chemical class of substance and their substrucutres
• Sources
• Reaction
    – Biological reaction
    – Pathway
    – Disease
•   Structure themselves
•   Experiment , experimental results, and researchers
•   Measure
•   Biological role, or function, of substances
Example of Entities in Expanded Ontology

• Biological role, or function of substances
   – receptor, inhibitor, …
• Biological reaction
   – activation, binding, inhibition, apoptosis, G2 arrest
   – pathway, signal
   – immune dysfunction, Ataxia telangiectasia (AT)
• Structure themselves
   – alpha-helix,
• Experiment, experimental results, researchers
   – our results, these studies, we
 Verbs Related to Biological Events
        Frequent Verbs in 100 MEDLINE Abstracts

Ver b         C ount   Ver b         C ount   Ver b       C ount   Ver b          C ount
be               255   involve           16   determine        9   explain             6
induce            56   identify          16   construct        9   exert               6
bind              50   act               15   associate        9   enhance             6
show              49   stimulate         14   reduce           8   display             6
suggest           42   provide           14   prevent          8   characterize        6
activate          42   express           13   locate           8   participate         5
factor            36   affect            13   line             8   localize            5
demonstrate       35   type              12   differ           8   investigate         5
inhibit           26   report            12   trigger          7   imply               5
have              25   form              12   synergize        7   establish           5
reveal            21   contribute        12   examine          7   conclude            5
require           21   study             11   block            7   compare             5
regulate          21   observe           11   become           7   use                 4
indicate          21   lead              11   analyze          7   transform           4
find              21   function          11   target           6   transfect           4
result            20   assay             11   signal           6   test                4
play              19   appear            11   remain           6   suppress            4
interact          18   occur             10   produce          6   support             4
mediate           17   increase          10   present          6   substitute          4
contain           17   phosphorylate      9   possess          6   share               4
    Verbs Related to Biological Events
           Verbs that take biological entities as arguments

• induce
   – noun BE INDUCED BY noun             activation of these PROTEIN was induced by PROTEIN
   – noun INDUCE noun                    PROTEIN induced the tyrosine phosphorylation
• bind
   – noun BIND TO noun                   the drugs bind to two different PROTEIN
   – noun BIND noun                      motifs previously found to bind the cellular factors
   – noun BINDING noun                   the TATA-box binding protein
   – the BINDING of noun                 the binding of PROTEIN


            semantic class: substance structure source experiment fact reaction
   Verbs Related to Biological Events
     Verbs whose arguments depend on syntactic patterns


• show
  – noun BE SHOWN to-infinitive   PROTEIN has been shown to trigger cellular PROTEIN activity
  – noun SHOW that-clause         the data show that PROTEIN stimulation is also not sufficient
  – noun SHOW noun                 SOURCE showed a dose-dependent inhibition of PROTEIN activity




           semantic class: substance source experiment fact
    Verbs Related to Biological Events
                           Verbs that take both entities

• indicate
   – noun INDICATE that-clause      the data indicate that PROTEIN is required in CELL prolifiration
   – noun INDICATE noun             these findings indicate an unexpected role of DNA

   – noun INDICATE that-clause      the structure indicates that it represents a unique class of PROTEIN
   – noun INDICATE noun             the structure indicates mechanisms for allosteric effector action




             semantic class: substance structure source experiment fact reaction role
                          Example of Annotated Texts
UI - 85146267
TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group"
     unsure="Class" cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human
     mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">.
AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class"
     cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV"
     unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a
     Percoll gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium"
     mt="SV" unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different
     concentrations of <NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK"
     cmt="">[3H]aldosterone</NE ti="6"> plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-
     26988" mt="SV" unsure="OK" cmt="">RU-26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha,
     17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17
     beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">), with or without an excess of unlabeled <NE ti="8"
     class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="8">. <NE ti="9"
     class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK" cmt="">Aldosterone</NE ti="9"> binds to a single
     class of <NE ti="10" class="protein" nm="receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">receptors</NE
     ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity
     data show a hierarchy of affinity of <NE ti="11" class="other_organic_compounds" nm="desoxycorticosterone" mt="SV"
     unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> = <NE ti="12" class="other_organic_compounds" nm="corticosterone"
     mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> = <NE ti="13" class="other_organic_compounds" nm="aldosterone"
     mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater than <NE ti="14" class="other_organic_compounds"
     nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE ti="14"> greater than <NE ti="15"
     class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK" cmt="">dexamethasone</NE ti="15">. The
     results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV" unsure="OK" cmt="">mononuclear
     leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE ti="16" class="protein"
     nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">mineralocorticoid receptors</NE
     ti="16"> and their regulation in humans.
                  CLASSの頻度分布
           Distribution of Semantic Classes of NEs




                                               organism

                                                          tissue

                                                                   cell type
           other name
                                                                         sub-location of cells

                                                                            other (natural source)
    atom
                                                                                      cell line
   inorganic                                                                    artificial source

other organic                                                               protein
 compounds
                                                                          peptide
                               lipid
       carbohydrate                                                      amino acid monomer
                        nucleic acid monomer
                                                                   DNA
                                          other polymer
                                                          RNA
          Subclass distributions in major classes
                         protein                                                                  DNA
                domain or
                  region

 substructure                 N/A                                                               N/A   family or group
                                                                                                                        molecule
                                                      family or group
subunit
                                                                                                                             substructure




  molecule
                                    complex
                                                                                    domain or
                                                                                      region




                            RNA                                                                 organism

             domain or                                                              virus
               region        N/A
                                              family or group
   substructure




                                                                        mono-cell
                                                                        organism
                  molecule
                                                                                                                        multi-cell
                                                                                                                        organism
      TAG NAME                  sub class         Count         TAG NAME            sub class       Count

                            multi-cell organism    477                           family or group     29

       organism             mono-cell organism     20                                complex          0

                                   virus           153                              molecule         81

         tissue                     -              213            DNA                subunit          0

       cell type                    -             1478                            substructure       41

 sub-location of cells              -              79                            domain or region    770

other (natural source)              -               1                                 N/A            24

        cell line                   -              695                           family or group     13

other (artificial source)           -               7
                                                                                     complex          0
                              family or group     1172
                                                                                    molecule         80
                                 complex           170
                                                                  RNA                subunit          0
                                 molecule         1181
                                                                                  substructure        1
        protein                  subunit           65
                               substructure        29                            domain or region     2

                             domain or region      77                                 N/A             4

                                   N/A             98        other polymer              -            43
        peptide                     -              40     nucleic acid monomer          -            47
 amino acid monomer                 -              27             lipid                 -           1113
                                                             carbohydrate               -            10
                                                             other organic              -           829
                                                              compounds
                                                               inorganic                -            29
                                                                 atom                   -            29
                                                              other name                -           2850
                Interface Representation

        from/Extracted               Defined by



Texts                                         Data and Knowledge bases

          Linguistic interface   Knowledge Interface




                                              Data and Knowledge bases
Texts




             Thesaurus              Ontology for Knowledge
                Interface Representation

        from/Extracted               Defined by



Texts                                         Data and Knowledge bases

          Linguistic interface   Knowledge Interface




                                              Data and Knowledge bases
Texts




             Thesaurus              Ontology for Knowledge
Compound Graph for Pheromone Signal Transduction Pathway
     G-protein coupled receptor complex

       G protein
                                                                          ligand
        G protein              G protein

         Beta    Gamma         Beta    Gamma        GPCR
         STE4     STE18        STE4     STE18
            Alpha                 Alpha
            GPA1                  GPA1
                                                                                    STE5 MAPK scaffold structure

                                                                        Non phos       Phos
                                                                                                  STE11/MKKK
                                 Beta      Gamma           STE5          STE11        STE11
     CDC42                       STE4      STE18
                                                                                                  STE7/MKK
      GTP
                       STE24                                            Non phos      Phos
                                                                         STE7         STE7

      CDC42
                                           STE20/                       Non phos       Phos     MAPK
       GDP                     BEM1
                                           MKKKK                         FUS3          FUS3

                                                                        Non phos       Phos     MAPK
                                                                         KSS1          KSS1


                                                              Transcription of
                                                                                           STE12
                                                            Mating-specific genes
   Cell polarization

                                                              Cell cycle arrest            FAR1
              SIGNAL-ONTOLOGY
               ontology for cell signaling
 SIGNAL MODULE an unit of signal processing in common to the
  model species
 MOLECULAR FUNCTION biochemical properties of a molecule
 CELLULAR FUNCTION a biological response performed by a set
  of molecules
 REACTION biochemical properties of a signaling reaction
 MOLECULE
 TISSUE
                     general in genome ontologies
 CELL
 SPECIES                        ~500 Terms
                                 Terms are linked to Gene Ontology
                Interface Representation

        from/Extracted               Defined by



Texts                                         Data and Knowledge bases

          Linguistic interface   Knowledge Interface




                                              Data and Knowledge bases
Texts




             Thesaurus              Ontology for Knowledge
             (3) Uncertain and Negative Information
Example ①
      Activation_Event:
               EID:=95080245.2
               Protein_1:=“IFN alpha”
                        Domain_1:= f
               Protein_2:=“STAT1”,”STAT2”,”STAT3”
                        Domain_2:= f
               Location:=“T cells”
               Definiteness:=definite
               Finding:=new
               Reaction_Type:= f
               Reaction_Path:=direct
               Mode:=affirmative
               SynSet:=<”T cells”,”human peripheral blood-
                        derived T cells”>


      Example: “IFN alpha activated STAT1,STAT2 and STAT3 in T cells, but
      no detectable activation of these STATs was induced by IL-2.”
               GENIA ontology
            (Ontology of Substances)
+-name-+-source-+-natural-+-organism-+-multi-cell organism
       |        |         |           +-mono-cell organism
       |        |         |           +-virus
       |        |         +-tissue
       |        |         +-cell type
       |        |         +-sub-location of cells
       |        +-artificial-+-cell line
       |
       +-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group
                                         |       |         +-protein complex
                                         |       |         +-individual protein molecule
                                         |       |         +-subunit of protein complex
                                         |       |         +-substructure of protein
                                         |       |         +-domain or region of protein
                                         |       +-peptide
                                         |       +-amino acid monomer
                                         |
                                         +-nucleic-+-DNA-+-DNA family or group
                                                   |     +-individual DNA molecule
                                                   |     +-domain or region of DNA
                                                   |
                                                   +-RNA-+-RNA family or group
                                                         +-individual RNA molecule
                                                         +-domain or region of RNA

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:9/20/2010
language:English
pages:57