Docstoc

Molecular Biology Databases

Document Sample
Molecular Biology Databases Powered By Docstoc
					         Research in the Verspoor Lab

Karin Verspoor, Ph.D.
Faculty, Computational Bioscience Program
University of Colorado School of Medicine




                                  Karin.Verspoor@ucdenver.edu
                                  http://compbio.ucdenver.edu/Hunter_lab/Verspoor
               Generally speaking…
• Focus on analysis of the biomedical literature
• For the purpose of:
  – Turning unstructured data (natural language text)
    into structured statements
  – Taking advantage of the wealth of information in
    the literature for biological data analysis
• Using (analyzing, building) semantic resources
  for the biomedical domain
           Today: Focus on Ontologies
• Use of the structure of ontologies to
    understand relations among protein
    annotations
•   Analysis of the term structure of ontologies
•   Particular ontology of interest:
    Gene Ontology
           Gene Ontology (GO)
• Taxonomic
    controlled
    vocabulary
•   ~ 16K nodes PGO
    populated by
    genes, proteins
•   Two orders on
    PGO: ≤isa,≤has
                      Gene Ontology Consortium (2000): “Gene Ontology: Tool For the
                          Unification of Biology”, Nature Genetics, 25:25-29
             The Gene Ontology: Usage
•   33703 terms
– 20403 biological_process
– 2810 cellular_component
– 8996 molecular_function
•   Gene Annotations for 40+
    organisms
•   3504 publications in
    PubMed matching “gene
    ontology” (3/8/11)
•   ISI Web of Knowledge:
    5371 refs to GO paper
                               Graph statistics as of June 9, 2009
          Protein Function Prediction
• Verspoor, K., Cohn, J., Mniszewski, S., and
  Joslyn, C. (2006). A Categorization Approach
  to Automated Ontological Function
  Annotation. Protein Science, v.15, pp.1544-
  1549.
               Automated Protein Function
                      Annotation
           x
                        • Mappings
                                    – From regions of sequence,
    Sequences                         structure, keyword spaces
                                    – Into regions of biological
                                      function space:
                                       • taxonomic bio-ontologies of
    Structures        Functions          molecular function
                                  • Characterize formal
                                    structure of bio-ontologies:
                                    – Order theoretical approaches
                                    – Combinatorial algorithms
Keywords/Literature
           POSOLE: POSet Ontology Laboratory Environment


•   POSOLE: a general environment for ontology experimentation
     –   Graph representation of an ontology as a POSet
     –   POSet statistics analysis (e.g. depth, width, average rank)
     –   Algorithms for node categorization utilizing the structure of the ontology
•   First Deployment: Ontology categorization for automated protein
    function annotation
     –   Function: Gene Ontology node
     –   Protein: target sequence or Swiss-Prot identifier
     –   Map proteins to sets of potential Gene Ontology nodes
     –   Ontology categorization: “clustering” nodes in ontology space to identify
         the most likely node assignment
•   Dual Queries: Text and sequence neighborhoods
                   POSOLE strategy
• Function Prediction as Categorization of
 Nearest Neighbors
  • Application of POSOC categorization methodology
    utilizing the Gene Ontology structure to find the best
    covering nodes given a set of node “hits”
  • “Hits” are based on (application-dependent) mappings
    from neighbors of an input protein to Gene Ontology
    nodes
  • Covering nodes are function annotation predictions
                   POSOLE architecture
•   PosoleRun, core of each
    application
     –   Load the graph (GO)
     –   Build a query, a set of query
         items
     –   Categorize the query items




•   Each application defines its own
    QueryBuilder
                        Categorization Task: POSOC
                      “Cluster” Genes in Ontology Space

                      http://www.c3.lanl.gov/posoc/
•   Given the Gene Ontology (GO) . . . And mappings to GO nodes . . .
•   “Splatter” them over the GO . . . Where do they end up?
     –   Concentrated?           -- Dispersed?
     –   Clustered?              -- High or low?
     –   Overlapping or distinct?
•   Pseudo-distances between comparable nodes to measure vertical separation
•   POSOC traverses the structure of the GO, percolating hits upwards, and
    calculating scores for GO nodes.
•   Scores to rank-order nodes with respect to gene locations, balancing:
     – Coverage: Covering as many genes as possible
     – Specificity: But at the “lowest level” possible
•   “Cluster” based on non-comparable high score nodes


                      Joslyn, Cliff; Mniszewski, Susan; Fulmer, Andy; and Heaton, Gary: (2004) “The Gene Ontology
                      Categorizer”, Bioinformatics, v. 20:s1, pp. 169-177
                       Order Theoretical
                     Categorization Method
•   Represent GO as labeled, finite ordered
    set                                                         1
•   Given labels (genes) c, e, i . . .
•   What node(s) A,B, C, . . . ,K are best to
    attend to?
     –C                                                 B           C        K
     –   {H, J}
     –   {A, H, J}                                              I
                                                                f
                                           F     G                  E        J
                                          b,d                       b       g,h,i
                                                            H
                                                            e
                                                 A                      D
                                                a,b,c                   j
POSOLE applications
                 Application: BioCreAtIvE I, Task 2
               Critical Assessment of Information Extraction in Biology

•   Automatic assignment of Gene Ontology annotations to human proteins
    based on a journal publication
     –   Given a Swiss-Prot/TrEMBL protein ID and a document, predict a GO node
         to which the protein should be annotated
     –   Also return the evidence text from the document supporting the annotation
•   Strategy: Annotation as Categorization of Document Neighborhood
     •   Application of POSOC categorization utilizing the Gene Ontology structure
         to find the best covering nodes given a set of node “hits”
     •   “Hits” in this case are based on overlaps between input terms and GO node
         terms (in labels, definitions)
               POSOC as applied to context
                        terms
•   Collect all terms in a context window of n sentences around any
    reference to the protein of interest
•   Transform an input query into a set of node hits:
     –   Morphologically normalize GO node labels
     –   Look for any overlaps between input terms and terms in the normalized node
         labels
     –   An overlap = a node hit, with strength based on the input weight of the term
         (from TFIDF)
     –   Multiple overlaps on a given node count as multiple hits
•   POSOC returns a set of GO nodes representing cluster heads for
    weighted term input set, and data on which input terms contributed to
    the selection of each cluster head: Annotation predictions
                             BioLASER:
           Los Alamos Semantic Event Recognizer for Biology

•   Text analysis
    environment:
     –   Relation extraction
     –   Term vector analysis
•   Domain-specific and
    application-specific
    components
•   Markup workflow
    implementation
              Application: CASP-6 Function
                        Prediction
               Critical Assessment of Structure Prediction evaluation
                             Function Prediction subtask

•Automatic assignment of Gene Ontology
annotations to target protein sequences
•Strategy: Annotation as Categorization of Sequence
Neighborhood
   •Application of POSOC categorization utilizing the
   Gene Ontology structure to find the best covering
   nodes given a set of node “hits”
   •“Hits” in this case are based on known mappings
   from proteins in the sequence neighborhood of the
   target to Gene Ontology nodes
CASP architecture
                         CASP Evaluation
•   Test set
     – proteins with known Gene Ontology mappings
     – 4530 SwissProt protein sequences associated from PDB
     – Protein to GO Mappings derived from UniProt
•   Eliminate PSI-BLAST identity matches from mappings used in
    prediction
     – Matches to protein with the same SwissProt Accession ID
     – Matches to protein with an accession ID that maps to the same
        SwissProt Entry ID
     – Matches to protein with an e-value < 10-130 or e-value < max e-
        value for known identity match
•   Goal: compare function predictions made by the system with known
    functions assigned to each input protein
                               CASP Evaluation runs
                          POSOC:                               POSOC:
                     Full Neighborhood                         Best Blast
                          Baseline:                            Baseline:
                     Full Neighborhood                         Best Blast

•   Baseline Best Blast: Predictions are the GO nodes associated with non-identical protein scoring
    highest in the PSI-BLAST analysis. All predicted GO nodes are considered to be at rank 1.
•   Baseline Full Neighborhood: Predictions are the GO nodes associated with all proteins matched in
    the PSI-BLAST analysis (with evalue < 10). The predictions are ranked according to the evalue of the
    corresponding PSI-BLAST match.
•   POSOC Best Blast: Inputs to POSOC are the GO nodes associated with non-identical protein
    scoring highest in the PSI-BLAST analysis, weighted by evalue of the match. POSOC categorizes and
    ranks these inputs to produce the predictions.
•   POSOC Full Neighborhood: Inputs to are the GO nodes associated with all proteins matched in
    the PSI-BLAST analysis, weighted by evalue of the match. POSOC categorizes and ranks these inputs to
    produce the predictions.
                     Evaluation analysis
• Precision/Recall
   – Precision = % of predictions that are correct
                     F(x)ÇG(x)
                P=
                         G(x)
   – Recall = % of known predictions that are recovered
                     F(x)ÇG(x)
                R=
                         F(x)

• Extension to ranked list of predictions
   – Consider precision/recall at different ranks
               Ontological Distance Metrics
•   How “far apart” are p and q?
•   Genealogical approach:

     • Radius 0: Equals: Direct match
     • Radius 1: Nuclear family: Parents,
       children, siblings
     • Radius 2: Extended family:
       grandparents, grandchildren, cousins,
       aunts/uncles, nieces/nephews
Evaluation results: Precision
Evaluation results: Recall
                     Evaluation of Ontological
                            predictions
•   Extension to ontological predictions:
    when does a GO node p in F(x) count
    as a “match” against a q in G(x)?
    – What about siblings? Ancestors?
    – Partial credit?
        •   Based on proximity
        •   Based on specificity
•   Adapt hierarchical precision/recall
    measure from Kiritchenko et al 2005

                              - pÇ-q                              - pÇ-q
    P=      å      max
                   p ÎF(x )        -q
                                        R=    å        max
                                                       q ÎG(x )    -p
         q ÎG(x)                             p ÎF(x)
                  Hierarchical Precision vs. Rank
                  (Cellular Component branch)
                                              Precision vs Rank (Cellular Component)


             1



            0.9



            0.8



            0.7



            0.6
Precision




            0.5



            0.4



            0.3



            0.2



            0.1



             0
                  1   2   3   4   5   6   7    8     9   10   11    12     13   14   15   16   17   18   19   20   21   22   23   24   25   26
                                                                           Rank
                                              Baseline Best Blast
                                              Baseline Full Neighborhood
                                              POSOC (Spec 4) Best Blast
                                              POSOC (Spec 4) Full Neighborhood
                  Hierarchical Precision vs. Rank
                   (Molecular Function branch)
                                              Precision vs Rank (Molecular Function)


             1



            0.9



            0.8



            0.7



            0.6
Precision




            0.5



            0.4



            0.3



            0.2



            0.1



             0
                  1   2   3   4   5   6   7   8   9   10    11    12   13    14   15    16    17   18   19   20   21   22   23   24   25   26
                                                                         Rank
                                                           Baseline Best Blast
                                                           Baseline Full Neighborhood
                                                           POSOC (Spec 4) Best Blast
                                                           POSOC (Spec 4) Full Neighborhood
                  Hierarchical Precision vs. Rank
                   (Biological Process branch)
                                              Precision vs Rank (Biological Process)


             1



            0.9



            0.8



            0.7



            0.6
Precision




            0.5



            0.4



            0.3



            0.2



            0.1



             0
                  1   2   3   4   5   6   7   8   9   10   11     12   13    14   15   16    17   18   19   20   21   22   23   24   25   26
                                                                        Rank

                                                                Baseline Best Blast
                                                                Baseline Full Neighborhood
                                                                POSOC (Spec 4) Best Blast
                                                                POSOC (Spec 4) Full Neighborhood
                         Summary:
                Protein Function Prediction
•   We have constructed the POSOLE architecture, supporting
    integration of mappings from different spaces into function space
•   We utilize the mathematical structure of function space as defined by
    the Gene Ontology to help identify commonalities and “clusters”, as
    well as in evaluation
•   We have proposed an extension to Kiritchenko et al’s hierarchical
    precision/recall measure to support comparison of sets of
    predictions and answers
•   The results on CASP function prediction show the promise of the
    POSOLE and POSOC technologies for automated annotation of
    protein sequences.
         Ontology Quality Assurance
• Verspoor, K., Dvorkin, D., Cohen, K.B.,
  Hunter, L. (2009) Ontology quality assurance
  through analysis of term transformations.
  Bioinformatics 25(12):i77-i84.
           Key quality concern: Univocality
• Univocality = one voice                 (Spinoza, 1677)
    “a shared interpretation of the nature of reality”
    (with thanks to David Hill @ Jackson Lab)

    • Consistency of expression of concepts
• Regular, compositional, linguistic structure
    – Facilitates human usability
    – Computational tools can utilize this regularity
Regulation of transcription         Transcription Regulation
Positive regulation of cell         Cell migration positive regulation
migration
         Quality Assurance in the GO
• Goal: identify violations of univocality
• Problem: the GO is generally very high quality;
  how to identify the few inconsistencies?

• Hypothesis: violations of univocality will
  correspond to transformational variants

• Strategy: term transformation & clustering
           GO Term Transformation:
                Abstraction
• Substitution of embedded GO & ChEBI terms
          toluene oxidation via 3-hydroxytoluene
               CTERM oxidation via CTERM

                 regulation of coagulation
                   regulation of GTERM

     leukotriene production during acute inflammatory
                          response
            CTERM production during GTERM
         GO Term Transformations
• Stopword removal
        toluene oxidation via 3-hydroxytoluene
          toluene oxidation 3-hydroxytoluene

               regulation of coagulation
                regulation coagulation
• Alphabetic reording
        3-hydroxytoluene oxidation toluene via
               coagulation of regulation
        Transformation combinations
• Abstraction=1, StopRemoval=1, Reordering=1
        toluene oxidation via 3-hydroxytoluene


               regulation of coagulation


   leukotriene production during acute inflammatory
                        response
        Transformation combinations
• Abstraction=1, StopRemoval=1, Reordering=1
        toluene oxidation via 3-hydroxytoluene
              CTERM CTERM oxidation

               regulation of coagulation
                  GTERM regulation

  leukotriene production during acute inflammatory
                      response
              CTERM GTERM production
                                         Clustering
• Group together all terms with a common form
    after transformation
•   Perform clustering for different combinations
    of transformations

asr {GTERM constit structu}
         GO:0005201 -- extracellular matrix structural constituent
         GO:0005199 -- structural constituent of cell wall
         GO:0005213 -- structural constituent of chorion
         GO:0005200 -- structural constituent of cytoskeleton
         GO:0003735 -- structural constituent of ribosome
         GO:0017056 -- structural constituent of nuclear pore
         GO:0019911 -- structural constituent of myelin sheath
                  Analysis of clusters




• Heuristic search:
  – Consider only clusters with abstraction (a±±)
  – Identify terms in distinct a-- clusters, but merge
    together in a-r, as-, or asr.
• Manual assessment of 190 clusters
            Transformation Impact




• 25,539 source GO terms (12/2007 version)
• Pre-processing reduces to 23,478 (8%)
• a=Abstraction, s=StopRemoval, r=Reordering
• Abstraction has most impact: 46% reduction
Abstraction breakdown,
      a-- clusters
Distribution of cluster size



--- transformation   asr transformation
             True Positive clusters
• 67 clusters
• 317 GO terms
• Obsolete term filter: 7 clusters, 32 terms
• Approximately 77 term rephrasings anticipated
                 True Positive inconsistencies
• {X Y} ≈ {Y of X} | {Y in X} [45%]
{GTERM GTERM organis symbion}
   GO:0052387 -- induction by organism of symbiont apoptosis
   GO:0052351 -- induction by organism of systemic acquired resistance in symbiont
   GO:0052350 -- induction by organism of induced systemic resistance in symbiont
   GO:0052560 -- induction by organism of symbiont immune response
   GO:0052399 -- induction by organism of symbiont programmed cell death
   GO:0052396 -- induction by organism of symbiont non-apoptotic programmed cell death

{GTERM multice organis}
   GO:0010259 -- multicellular organismal aging
   GO:0022412 -- reproductive cellular process in multicellular organism
   GO:0032504 -- multicellular organism reproduction
   GO:0033057 -- reproductive behavior in a multicellular organism
   GO:0033555 -- multicellular organismal response to stress
   GO:0035264 -- multicellular organism growth
                                 True Positives (2)
• Determiners [16%]
{GTERM forebra}
   GO:0021861 -- radial glial cell differentiation in the forebrain
   GO:0021846 -- cell proliferation in forebrain
   GO:0021872 -- generation of neurons in the forebrain

{GTERM organ}
   GO:0031100 -- organ regeneration
   GO:0035265 -- organ growth
   GO:0010260 -- organ senescence
   GO:0001759 -- induction of an organ
                               True Positives (3)
• Other alternations [16%]
{GTERM selecti site}
   GO:0000282 -- cellular bud site selection
   GO:0000918 -- selection of site for barrier septum formation

• Conflicting conventions [6%]
{GTERM endothe} (partial listing)
   GO:0003100 -- regulation of systemic arterial blood pressure by endothelin
   GO:0004962 -- endothelin receptor activity

• Punctuation [3%]
   GO:0016653 -- oxidoreductase activity, acting on NADH, heme protein as acceptor
   GO:0016658 -- oxidoreductase activity, acting on NADH, flavin as acceptor
   GO:0050664 -- oxidoreductase activity, acting on NADH, with oxygen as acceptor

   GO:0043247 -- telomere maintenance in response to DNA damage
   GO:0042770 -- DNA damage response, signal transduction
                    True Positives (4)
• “Grab bag”
  – Lexical choice
     • “within” vs. “in”
     • “substrate-specific” vs. “substrate-dependent”
  – Superfluous words like “other”
False positive breakdown
              False positive cluster examples
• Semantic import of stopword [50%]
{CTERM GTERM levels modulat symbion} (partial listing)
   GO:0052430 – modulation by host of symbiont RNA levels
   GO:0052018 – modulation by symbiont of host RNA levels

{CTERM CTERM galacto GTERM}
   GO:0033580 -- protein amino acid galactosylation at cell surface
   GO:0033582 -- protein amino acid galactosylation in cytosol
   GO:0033579 -- protein amino acid galactosylation in endoplasmic reticulum

{callose deposit GTERM}
     GO:0052542 -- callose deposition during defense response
     GO:0052543 -- callose deposition in cell wall
                               False positives (2)
• Non-parallel structure [27%]
{CTERM CTERM}
   GO:0005204 -- chondroitin sulfate proteoglycan
   GO:0006088 -- acetate to acetyl-CoA
   GO:0015641 -- lipoprotein toxin

{GTERM GTERM GTERM} (partial listing)
   GO:0019896 -- axon transport of mitochondrion
   GO:0047496 -- vesicle transport along microtubule
   GO:0047497 -- mitochondrion transport along microtubule
   GO:0032066 -- nucleolus to nucleoplasm transport
   GO:0052067 -- negative regulation by symbiont of entry into host cell via phagocytosis

{GTERM storage}
   GO:0001506 -- neurotransmitter biosynthetic process and storage
   GO:0000322 -- storage vacuole
                                False positives (3)
• Stemming [17%]
{regulat GTERM} (partial listing)
    GO:0045066 -- regulatory T cell differentiation
    GO:0045069 -- regulation of viral genome replication
    GO:0045055 -- regulated secretory pathway
    GO:0031347 -- regulation of defense response

• Syntactic variation [5%]
{GTERM mainten}
   GO:0045216 -- intercellular junction assembly and maintenance
   GO:0045217 -- intercellular junction maintenance
   GO:0045218 -- zonula adherens maintenance

• Semantic import of word order[5%]
 {GTERM CTERM activit}           {CTERM GTERM activit}
 apoptosis inhibitor activity    gibberellin binding activity
                         Conclusions
• Used simple term transformations and
    heuristic search
•   Able to reduce set of clusters to be manually
    evaluated to 190 (for 25k terms)
•   Identified 67 TP instances of univocality
    violations covering 317 GO terms
•   Future work
    – More specific linguistic alternations
    – Improve heuristics for TP search
                    GO as a lexical semantic resource
   • The Gene Ontology represents semantic relationships (is_a, part_of)
     between biological phrases representing molecular functions/processes
   • Utilize the structure of the GO and lexical correspondences to infer
     relationships at the term level from relationships between phrases




Verspoor, C., C. Joslyn and G. Papcun (2003). "The Gene Ontology as a Source of Lexical Semantic Knowledge for a Biological
Natural Language Processing Application". In Proceedings of the SIGIR'03 Workshop on Text Analysis and Search for Bioinformatics.
                  Inferring Lexical Relations from GO
                           Parallel rule:
                           vanillin metabolism isa aldehyde metabolism ⇒
                           vanillin isa aldehyde
                           lipoprotein biosynthesis isa lipoprotein metabolism ⇒
                           biosynthesis isa metabolism

                           Modifier rule: blocking rule for modifiers
                           Positive gravitactic behavior isa gravitactic behavior ⇒ Ø
                           Larval feeding behavior (sensu insecta) isa Larval feeding
                           behavior ⇒ Ø

                           Insertion rule: right-branching heuristic
                           adult feeding behavior isa adult behavior ⇒
                           feeding behavior isa behavior
                           chemosensory jump behavior isa chemosensory behavior
                           ⇒jump behavior isa behavior

Verspoor et al.
(2003)
                                   Relations inferred (with
     581
                                           counts)
                  biosynthesis isa metabolism          14   inhibitor isa regulator
     577          catabolism isa metabolism            13   ribonucleotide isa nucleotide
     44           receptor isa binding                 11   proliferation isa activation
     38           deoxyribonucleoside isa nucleoside   11   differentiation isa activation
     35           ribonucleoside isa nucleoside        11   deoxyribonucleotide isa nucleotide
     33           permease isa transporter             10   rRNA isa RNA
     27           Saccharomyces isa Fungi              10   mRNA isa RNA
     22           porter isa transporter               9    snRNA isa RNA
     15           oxidation isa metabolism             8    modification isa metabolism
     14           tRNA isa RNA                         8    methylation isa modification



 6,364 unique relations inferred; only 70 already exist in the GO
 3,270/6,589 unique labels inferred that do not occur in the GO
 as terms



Verspoor et al.
(2003)
                  A portion of the induced
                          network




                            773 trees in the induced hierarchy
                            • 669 depth 2, 69 depth 3
                            • max depth 10, “biosynthesis”




Verspoor et al.
(2003)
              Other uses of ontological
                     structure
• Use of subsumption hierarchy to fill in
    examples for machine learning
•   Establish semantic (functional) similarity
    – Graph distance
    – Information content
• Inference, reasoning
• Constraints for information extraction (next
    time …)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:2/2/2012
language:
pages:56