Bioinformatics Databases
Amandeep S. Sidhu Data Mining in Bioinformatics (Week 2)
Outline
Biological Data Database Growth Challenges of Large Databases Genes Gene Databases Proteome Protein Databases Other Data Data Searching – ANGIS Summary Exercises
Biological Data
DNA and Protein Sequences are annotated
Source Organism Function Updates Etc.
Database Growth
Database Growth
Database Growth
Challenges of Large Databases
Storage
Indexing, physical layout, memory management
Modeling
Relational, hierarchical, semi-structured Update, query, analysis Visualization
Efficiency
Interpretation
What is a Gene?
the
physical and functional unit of heredity that carries information from one generation to the next sequence necessary for the synthesis of a functional protein or RNA molecule
DNA
Genome
chromosomal DNA of an organism
number
of chromosomes and genome size varies quite significantly from one organism to another size and number of genes does not necessarily determine organism complexity
Genome
Genome Comparison
ORGANISM CHROMOSOMES GENOME SIZE GENES Homo sapiens (Humans) 23 3,200,000,000 ~ 30,000
Mus musculus (Mouse)
Drosophila melanogaster (Fruit Fly) Saccharomyces cerevisiae (Yeast) Zea mays (Corn)
20
2,600,000,000
~30,000
4
180,000,000
~18,000
16
14,000,000
~6,000
10
2,400,000,000
???
Genome Databases
NCBI - http://www.ncbi.nlm.nih.gov/ GeneBank http://www.ncbi.nlm.nih.gov/Genbank/ SRS – http://srs.ebi.ac.uk/ GDB - http://gdbwww.gdb.org/ OMIM http://www.ncbi.nlm.nih.gov/entrez/query.f cgi?db=OMIM …..
Proteome
the
complete collection of proteins that can be produced by an organism. be studied either as static (sum of all proteins possible) or dynamic (all proteins found at a specific time point) entity
can
Proteome Databases
PDB & WWPDB http://www.rcsb.org/pdb/ http://www.wwpdb.org/index.html PIR, SWISS-PROT & UniProt http://pir.georgetown.edu/home.shtml http://au.expasy.org/sprot/ http://www.expasy.uniprot.org/ InterPro http://www.ebi.ac.uk/interpro/ SCOP http://scop.mrc-lmb.cam.ac.uk/scop/
Other Data Annotations
PubMed
http://www.ncbi.nlm.nih.gov/entrez/query.f cgi GO http://www.geneontology.org/ PO http://proteinontology.info/ http://proteomeontology.org/
Data Searching
ANGIS
http://www.angis.org.au/ ANGIS Demo Login Details
Summary
Bioinformatics
is truly interdisciplinary
Biology (natural sciences), informatics, mathematics & statistics
Databases
Large, semistructured, incomplete, inaccurate
Wide-range
of problems
Solutions employ knowledge from sciences with algorithms and models from informatics, mathematics, and statistics
Exercises