Docstoc

Omics data integration _ mining

Document Sample
Omics data integration _ mining Powered By Docstoc
					BK21 BT·IT Integrationist Program



      The Sixth Sino-Japan-Korea Bioinformatics Training Course
                     Shanghai, March 27-30, 2007


           Omics data integration & mining


                          2007. 3. 29
                        Sangsoo Kim &
                      KOBIC Omics Team
What is the goal of Biosciences?

• Ultimately, the complete understanding of life
  phenomena
  –   Complex organization
  –   Regulatory mechanism (homeostasis)
  –   Growth & development
  –   Energy utilization
  –   Response to the environmental stimuli
  –   Reproduction (DNA guaranties exact replication)
  –   Evolution (capacity of species to change over time)
Spider Silk: Stronger than Steel
• Life’s diversity results from
  the variety of molecules in
  cells
• A spider’s web-building
  skill depends on its DNA
  molecules
• DNA also determines the
  structure of silk proteins
  – These make a spiderweb
    strong and resilient
• The capture strand contains a single
  coiled silk fiber coated with a sticky fluid
• The coiled
  fiber
  unwinds
  to capture
  prey and
  then
  recoils
  rapidly              Coiled fiber      Coating of
                       of silk protein   capture strand
 Evidence from flagelliform silk cDNA for the structural
  basis of elasticity and modular nature of spider silks
        J Mol Biol. 1998 Feb 6;275(5):773-84

• They report the cloning of substantial
  cDNA for flagelliform gland silk protein,
  which forms the core fiber of the
  catching spiral
• The dominant repeat of this protein is
  Gly-Pro-Gly-Gly-X, which can appear
  up to 63 times in tandem arrays
• They propose that the spring-like helix
  is the basis for the elasticity of silk
Central dogma of molecular biology




 DNA         RNA          protein
Paradigm Shift in Biosciences

• So far, biologists have focused certain
  phenotypes and hunted the genes
  responsible, one at a time
                                 Genomics &
• New trend is                    Proteomics

  – Catalog all the parts: genes and proteins
                                        FunctionalGe
                                           nomics
  – Understand how each part works        Systems
  – Model & simulate the collective behavior of
                                          Biology
    the parts
    Central dogma of molecular biology




      DNA         RNA         protein




genome        transcriptome         proteome

 Central dogma of bioinformatics and genomics
                Sequences (millions)




       1982
       1986
       1990
Year
       1994
       1998
       2002




              Base pairs of DNA (billions)
   With $1,000 genome sequencing
technologies in 10 years coupled with
 functional data, we need better IT
               solutions!
                                          GenB a nk Growth

                                 1.E+11

                                 1.E+10

                                 1.E+09



                     basepairs
                                 1.E+08

                                 1.E+07

                                 1.E+06

                                 1.E+05
                                    82
                                    85
                                    88
                                    91
                                    94
                                    97
                                    00
                                    03
                                  19
                                  19
                                  19
                                  19
                                  19
                                  19
                                  20
                                  20
   Proliferation of Genomics
• Explosion of data
  – Human genes: 25,000
  – Human genome: 3x109 bp
  – DNA-protein or protein-protein interactions
    could increase data dramatically
• Chimpanzee, mouse, rat, dog, cow,
  chicken, insects, worms, plants, fungi,
  algae, bacteria, archaea, viruses …
Genome Projects (385 finished)
                    as of June 4, 2006




                    Ongoing projects

                    608 eukaryotes

                    989 prokaryotes
Top ten challenges for bioinformatics

[1] Precise models of where and when transcription
     will occur in a genome (initiation and termination)

[2] Precise, predictive models of alternative RNA splicing

[3] Precise models of signal transduction pathways;
    ability to predict cellular responses to external stimuli

[4] Determining protein:DNA, protein:RNA, protein:protein
    recognition codes

[5] Accurate ab initio protein structure prediction
Top ten challenges for bioinformatics

[6] Rational design of small molecule inhibitors of proteins

[7] Mechanistic understanding of protein evolution

[8] Mechanistic understanding of speciation

[9] Development of effective gene ontologies:
    systematic ways to describe gene and protein function

[10] Education: development of bioinformatics curricula

                                              Source: Ewan Birney,
                                              Chris Burge, Jim Fickett
Functional Genomics & Systems Biology

• New data types:
  – Sequences
  – Structures
  – High throughput expression profiles in (10,000 x
    100) matrix forms
  – Interactions, Pathways, Networks
• Mathematical modeling & simulation of
  biological processes
  – Algorithms
  – Graphical visualization
18C




19C




20C

      K-JIST
               Terminology

  DNA              Genome               Genomics


  RNA           Transcriptome        Transcriptomics


 Protein           Proteome             Proteomics


Metabolite       Metabolome           Metabolomics


     More than 50-omes including “Unknownome”

                                                     K-JIST
                  Omics data
• In the Omics era, we see proliferation of
  genome/proteome-wide high throughput data
  that are available in public archives
  –   Comparative genome sequences
  –   Sequence variation & phenotypes
  –   Epigenetics & chromatin structure
  –   Regulatory elements & gene expression
  –   Protein expression, modification & localization
  –   Protein domain, structure, interaction
  –   Metabolic, signal, regulatory pathways
  –   Drug, toxicogenomics, toxicoproteomics
Joyce et al. Nature Reviews Molecular Cell Biology 7, 198–210 (March 2006) | doi:10.1038/nrm1857
Joyce et al. Nature Reviews Molecular Cell Biology 7, 198–210 (March 2006) | doi:10.1038/nrm1857
Joyce et al. Nature Reviews Molecular Cell Biology 7, 198–210 (March 2006) | doi:10.1038/nrm1857
Joyce et al. Nature Reviews Molecular Cell Biology 7, 198–210 (March 2006) | doi:10.1038/nrm1857
Joyce et al. Nature Reviews Molecular Cell Biology 7, 198–210 (March 2006) | doi:10.1038/nrm1857
                 As an example,
• Suppose you are interested in how much the CDK2
  trascription control is conserved, you may need
   – Orthologs in various model organisms
   – Genome alignments of promoter regions among
     phylogenetic cousins
        • Among mammalians or vertebrates
        • Among yeast subsepecies
   –   Transfac-type of TF binding database
   –   ChIP-chip data for each organism
   –   Orthology map of the TF’s and so on
   –   You may add proteome and interactome
• Only part of them are available at NCBI
• Rest of them are available in the public domain as an
  supplementary materials or at the author’s web sites
      Integration of Omics data
•   Systematic mining
•   Cross-knowledge domain validation
•   Cross-species interpolation
•   Generation of hypotheses that can be tested

⇒Biologically very interesting queries
⇒Requires cross-functional knowledge
⇒The way to go
               Organization of data
                human   mammal   vertebrate   animal   eukaryote

Genome
sequence
Chromatin
structure
Transcription
& regulation
Protein
expression
PTM &
localization
Structure &
interaction
            Where to look for
• Nature provides omics section
    – www.nature.com/omics
•   Science
•   Cell
•   PLoS Biology
•   Genes & Development
•   Stem Cell
+   Relevant articles (PubMed, Google Scholar)
ENCyclopedia Of DNA Elements
 (ENCODE) funded by NHGRI
NHGRI Current Topics in Genome Analysis 2006
NHGRI Current Topics in Genome Analysis 2006
ENCODE Genomes to seuqnce
        Phase 1 of ENCODE
• NHGRI’s ENCODE project generates such
  data at a pilot scale
• The data are deposited and integrated into
  the UCSC Genome Browser
  – It offers data mining capability via Table Browser
  – There is no ‘biological links’ among the 3,000+
    tables (Ensembl’s BioMart is more ‘biological’)
  – It is upto the users how to combine the tables
  – It is limited to genomic coordinates, not intended
    for proteome work
ENCODE Data Integrated in
 UCSC Genome Browser
A ~2kb conserved, transcribable, Ac-histone,
 pol2-binding element in the 1st intron of ST7
Turned out to be a pseudo gene!
And also duplicated in other parts of genome!
Omics Dataset Example
       Application Examples




Joyce et al. Nature Reviews Molecular Cell Biology 7, 198–210 (March 2006) | doi:10.1038/nrm1857
Protein-DNA Interaction & Transcriptomics
                               Yeast rich
                               medium gene
                               modules
                               network

                              • ChIP-chip
                                location and
                                expression
                                data
                              • 106 modules
                                containing 655
                                genes
                                regulated by 68
                                TFs
Protein-DNA Interaction & Transcriptomics
Predicting Protein-Protein Interaction
   by combining multiple datasets
Predicting Protein-Protein Interaction
   by combining multiple datasets
Predicting Protein-Protein Interaction
   by combining multiple datasets
            How to participate
• Domain knowledge group
  – Monitoring papers and websites of relevant data
  – Collect the omics data and transform into common
    formats
  – Develop hypotheses & mining strategies
• Data integration group
  –   Develop DB schema
  –   Integration with bio-matrix & bio-engine
  –   Querying biological concepts
  –   Graphic visualization
Practice Session - Cytoscape
• Installation
  – One of the most widely used and broadly
    accessible software packages designed to
    facilitate omics data integration and
    analysis
• Totorials
  – Interaction network display
  – Expression analysis
  – Literature searching