Introduction to Bioinformatics
Changhui (Charles) Yan Old Main 401 F http://www.cs.usu.edu/~cyan
6/15/2006
Changhui Yan
1
How Old Is The Discipline?
"The term bioinformatics is a relatively recent invention, not appearing in the literature until 1991 …However, … had been building databases, developing algorithms and making biological discoveries by sequence analysis since the 1960s---long before anyone thought to label this activity with a special term ….So bioinformatics has, in fact, been in existence for more than 40 years …” (Mark S. Boguski, Trends Guide to Bioinformatics Elsevier, Trends Supplement 1998 p1)
6/15/2006 Changhui Yan 2
What Is Bioinformatics ?
Any use of computers to handle biological information The use of computers to characterize biology molecules or to simulate dynamics of molecules The use of computers to store, compare, retrieve, or analyze biology information Computational Biology, Proteomics, Genomics, Medical Informatics …
6/15/2006
Changhui Yan
3
Bioinformatic Problems
6/15/2006
Changhui Yan
4
Central Dogma
6/15/2006
Changhui Yan
5
Genome
6/15/2006
Changhui Yan
6
Bioinformatic Problems
Genome Sequencing
6/15/2006
Changhui Yan
7
Human Genome Project (HGP)
To determine the sequences of the 3 billion bases that make up human DNA To identify the approximate 100,000 genes in human DNA (The estimates has been changed to 20,00025,000 by Oct 2004) To store this information in databases To develop tools for data analysis
6/15/2006
Changhui Yan
8
Human Genome Project (HGP)
HGP began in October 1990 and completed in 2003 99% human DNA sequence finished to 99.99% accuracy (April 2003) 15,000 full-length human genes identified (March 2003) Finished genome sequences of E. coli, S. cerevisiae, C. elegans, D. melanogaster (April 2003) Post-genome era
6/15/2006
Changhui Yan
9
Completely Sequenced Genomes
6/15/2006
Changhui Yan
10
Genome Projects
More than 60 eukaryotic genome sequencing projects are underway
6/15/2006
Changhui Yan
11
Genome Sequencing
6/15/2006
Changhui Yan
12
Genome Sequencing
6/15/2006
Changhui Yan
13
Difficulties due to
Repeats Uncertainty Missing data Huge size!!!! …
6/15/2006
Changhui Yan
14
Gene finding
Genome Sequencing Gene Finding
15,000 human genes identified The estimates are 100,000 (1990) … 20,000-25,000 (Oct 2004) 3 billion bases that make up human DNA
6/15/2006
Changhui Yan
15
Gene-finders
6/15/2006
Changhui Yan
16
Sequence Alignment
Genome Gene Finding Sequence alignment
6/15/2006
Changhui Yan
17
Longest Common Subsequences
6/15/2006
Changhui Yan
18
Sequence Alignment
Pair-wise Alignment Multiple Sequence Alignment Searching Databases http://www.ncbi.nlm.nih.gov/BLAST/
6/15/2006
Changhui Yan
19
Sequence Alignment
Global vs. Local
6/15/2006
Changhui Yan
20
Gene Expression
Genome Sequencing Gene Finding Sequence Alignment Gene Expression
6/15/2006
Changhui Yan
21
Gene Expression
6/15/2006
Changhui Yan
22
Protein Folding
Genome Sequencing Gene Finding Sequence Alignment Gene Expression Protein Structure
6/15/2006
Changhui Yan
23
Protein Structure
Visualization of protein structure Protein structure alignment Protein structure prediction
6/15/2006
Changhui Yan
24
Protein Structure Prediction
Comparative modeling If the sequence is similar to another one whose structure is known. Fold recognition In absence of a significantly similar sequence with known structure, these methods try to determine how well a known structure fits the sequence to model. Ab initio prediction Can detect the structures that have not been discovered. Monte Carlo search for lowest energy.
6/15/2006
Changhui Yan
25
Protein Function Prediction
Genome Sequencing Gene Finding Sequence Alignment Gene Expression Protein Structure
Protein Function
6/15/2006
Changhui Yan
26
Protein Function Prediction
“similar sequence-similar structure-similar function” paradigm
Identification of homologous sequences (BLAST, PSIBLAST) (>30% identity) Identification of conserved functional sites (<=30%)
6/15/2006
Changhui Yan
27
Conserved Functional Sites -- Motifs
[AG]-G-x(0,1)-[GAP]-x-N-x-[STA]-x(6)-[GS]-x(9)-G
6/15/2006
Changhui Yan
28
Motifs
6/15/2006
Changhui Yan
29
Conserved Functional Sites -- Motifs
Single motif
PROSITE: a database of biologically significant sites
6/15/2006
Changhui Yan
30
Conserved Functional Sites -- Motifs
Multiple motifs
PRINTS: a database of protein fingerprints. A fingerprint is a group of conserve motifs characterizing a protein function
6/15/2006
Changhui Yan
31
PRINTS
>ATHA_PIG
6/15/2006
Changhui Yan
32
PRINTS
6/15/2006
Changhui Yan
33
Conserved Functional Sites -- Motifs
Hidden Markov Model
Pfam:
6/15/2006
Changhui Yan
34
Protein Interaction Network
Genome Gene Finding Sequence Alignment Gene Expression Protein Structure
Protein Function Protein Interaction Network
6/15/2006
Changhui Yan
35
Protein Interaction Network
6/15/2006
Changhui Yan
36
6/15/2006
Changhui Yan
37
Protein Interaction Network
6/15/2006
Changhui Yan
38
Bioinformatic Problems
Genome Gene Finding Sequence Alignment Gene Expression Protein Structure
Protein Function Protein Interaction Network
6/15/2006
Changhui Yan
39
Bioinformatic Problems
There are more ….
Phylogeny analysis: Tree of life Databases and tools development
6/15/2006
Changhui Yan
40
Bioinformatic Databases
GenBank (DNA sequences) ProteinDataBank (Protein structures) PIR (Protein sequences) Nucleic Acids Research (2005) 719 databases
6/15/2006
Changhui Yan
41
Bioinformatic Programs
Sequence analysis: BLAST, ClustalX, EMBOSS, GCG Molecular imaging/modeling: PyMol, MOLMOL, RasMol …
6/15/2006
Changhui Yan
42