Introduction to Bioinformatics - PDF

Document Sample
Introduction to Bioinformatics - PDF Powered By Docstoc
					Introduction to Bioinformatics
Birth of bioinformatics

Why Bioinformatics
A lot of information generated by all the genome sequencing projects.

DNA sequence
>gi|49175990|ref|NC_000913.2| Escherichia coli K12, complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC TACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCAGTGACGGAACGGCTGGCCATTATCTCGGTGG TAGGTGATGGTATGCGCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCACTGGCCCGCGCCAATAT CAACATTGTCGCCATTGCTCAGGGATCTTCTGAACGCTCAATCTCTGTCGTGGTAAATAACGATGATGCG ACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTATCGAAGTGTTTGTGATTG GCGTCGGTGGCGTTGGCGGTGCGCTGCTGGAGCAACTGAAGCGTCAGCAAAGCTGGCTGAAGAATAAACA ………

1

Computer needed!
1. The storage and retrieval 2. The analysis

First some history

The pre-70’s: Pioneering Computational Studies
–  In 1866 a monk called Gregor Mendel :The theories of heredity. –  In 1953 Watson and Crick proposed the structure of the DNA. Not to forget the initial work by Rosalind Franklin. –  Those fundamental problems and others in the early days of molecular biology presented marvelous algorithmic problem. –  In parallel fundamental computer science emerged in the 1950s and 1960s. –  The definition of grammars by Chomsky. –  In 1955, The sequence of the first protein to be analyzed, bovine insulin, is announced by F.Sanger. –  In 1965 creation of Dayhoff’s Atlas of Protein Sequences (first sequence database). and, –  Even before the field was called bioinformatics some people already used computer algorithm to resolve molecular biology problems. (The first sequence alignment algorithm was created in 1970 by Needleman and Wunsch)
Ref: Ouzounis C.A. and Valencia A. Early bioinformatics: the birth of a discipline--a personal view. 
 Bioinformatics. 2003 Nov 22;19(17):2176-90.

2

The DNA structure
•  In the 50’s many people where looking to find the structure of DNA. It was already suggested by the 1943 that DNA was the genetic material but the structure was needed to be able to understand how it worked. •  Based on the finding of many researcher and especially on the work of Rosalind Franklin who suggested that the DNA was in a helix form, Watson and Crick propose the DNA structure in 1953.

The Photograph 51
•  Rosalind Franklin x-ray photograph of the B-DNA (the double helix)

In Computer Science
•  In the 50’s and 60’s fundamental computer science emerged.
–  The theory of computation by Chaitin in 1966) –  The definition of grammars by Chomsky –  Etc…

3

Chomsky Grammars
•  Also in the 50’s Noam Chomsky (Chomsky, 1957) propose a theory for modeling strings of symbol “ Chomsky hierarchy of transformational grammars” It was develop to understand natural language but they became very important in theoretical computer science because computer languages can be specified with a formal grammar (Durbin et al, 1998). His research influenced not only the linguistic but also computer sciences (a grammar refer to a language natural and artificial).

Formal Grammar
•  The Chomsky hierarchy consists of the following levels:
–  –  –  –  Regular grammars Context-free grammars Context-sensitive grammars Unrestricted grammars

•  Regular grammar are widely used in bioinformatics by the use of regular expression to find motif in sequences etc.

The Needleman-Wunsch Algorithm
•  The Needleman-Wunsch algorithm (1970) for global alignment of 2 biological sequences with dynamic programming. •  Use to find optimal sets of diagonals. •  It is slow but you have a guarantee to have a mathematically optimal alignment.

4

The 70’s The Theoretical Foundations
–  An agenda for computational problem in molecular biology had already been formulated. –  RNA structure prediction (Tinoco et al. 1971) –  String comparison problem in computer science or sequence alignment in biology developed hand-in-hand with applications to biological macromolecules (Sankoff and Cedergreen 1973). –  Sequences alignment is a big are of research –  Protein folding problem. –  Compilation of computer archives for storage, curation and distribution of protein sequence and structure (Dayhoff 1978 et Bernstein et al 1977) What is missing? The central reference data and software resources.

Ref: Ouzounis C.A. and Valencia A. Early bioinformatics: the birth of a discipline--a personal view. 
 Bioinformatics. 2003 Nov 22;19(17):2176-90.

RNA 2D structure prediction
•  Working hypothesis: The native secondary structure of a RNA molecule is the one with the minimum free energy •  Restrictions:
–  No knots –  No close base pairs –  Base pairs: A-U, C-G and G-U

Tinoco
•  Tinoco-Uhlenbeck postulate: –  Assumption: the free energy of each base pair is independent of all the other pairs and the loop structures –  Consequence: the total free energy of an RNA is determined by summing the energy contributions of all base pairs, loops, hairpins, etc.
1.0 -3.3 -0.6 -1.1

rCCUUGAGGAACACCAAAGGGG

G A CCUU GGGGA
-1.3 -2.1

∆G37= -1.3-3.3-2.1-0.6+1-3.3-1.1+5.6 = -5.1kcal/mol

GG

A

CC A A A
-3.3 5.6

A C

5

Sequence alignment
•  Origin
–  The composition of a biologic sequence is an genetically inherited feature.

•  Goal
–  Deduce functional, structural and phylogenetic relations between sequences. –  Build computer science tools for database search, phylogeny reconstruction, assembly etc…

•  Search for sequence similarity is the most used task in sequence analysis.

Sequence alignment
•  Definition
–  A linear comparison of amino (or nucleic) acid sequences in which insertions are made in order to bring equivalent positions in adjacent sequences into the correct register. Alignments are the basis of sequence analysis methods, and are used to pin-point the occurrence of conserved motifs. www.biochem.ucl.ac.uk/bsm/dbbrowser/jj/glossary.html –  The arrangement of two or more amino acid or base sequences from an organism or organisms in such a way as to align areas of the sequences sharing common properties. The degree of relatedness or homology between the sequences is predicted computationally or statistically based on weights assigned to the elements aligned between the sequences. This in turn can serve as a potential indicator of the genetic relatedness between the organisms. www.geocities.com/pribond/bioinfo/glossary/sequencing.htm

DNA = text
cable able table cable e rope
which words are more similar?

6

DNA = text
cable able table cable e rope

higher similarity in the text.

DNA = text
cable able table cable e rope

higher similarity in the meaning.

DNA = text
•  It is more suitable to use text similarity for protein or DNA sequences than for English word because unlike English language the order of the letter in a protein or DNA sequence imply a function.
ACGCTGGCTGCTG ACG--GGGTGCTG ACCTGTGTTTTT A------TTTTT

7

DNA = text
•  It is more suitable to use text similarity for protein or DNA sequences than for English word because unlike English language the order of the letter in a protein or DNA sequence imply a function.
ACGCTGGCTGCTG ACG--GGGTGCTG ACCTGTGTTTTT A------TTTTT

more similar

The 80’s More Algorithms and Resources
–  In the 80’s the field became an independent discipline. –  We know now that computer analysis of nucleotide sequences is essential to better understand the biology. –  Big progress in DNA and RNA structure analysis (Trifonov and Sussman 1980). –  Development in sequence comparison with Smith-Waterman dynamic programming algorithm (Smith-Waterman 1981). –  First Santa Fe conference to assess the feasibility of a Human Genome Initiative. –  Progress in database quality control and collection (GenBank and EMBL, SWISSPROT). –  Protein structure analysis experienced a significant growth (stereo diagrams, domain definition and hydrophobicity plot) –  The National Center for Biotechnology Information (NCBI) is created at NIH/ NLM. http://www.ncbi.nlm.nih.gov/

Ref: Ouzounis C.A. and Valencia A. Early bioinformatics: the birth of a discipline--a personal view. 
 Bioinformatics. 2003 Nov 22;19(17):2176-90.

NCBI

8

NCBI
•  “Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.” (NCBI website)

•  Entrez is a retrieval system designed for searching several linked databases.(NCBI website)

ENTREZ

50 40 30 20 10 0 1985 1990 1995 2000

December 1982 600 sequences in the 80’s

September 2005

Base pairs of DNA (billions)

60

GenBank

August 15 2008 release : 95033791652 bases, from 92748599 reported sequences

Sequences (millions)

9

90’s- today
–  The field grow fast:
•  •  •  •  In the 80’s GeneBank had around 600 sequences. In the 90’s GeneBank had 79 thousands sequences In 2006 it had more than 61 millions sequences. In 2008 more than 92 millions of sequences.

–  Tools like BLAST (Basic Local Alignment Search Tool ) available (Altschul et al 1990) –  Wide use of the GCG (Genetics Computer Group) software. –  Gene prediction program. –  In 1995 the first (nonviral) whole genome sequenced (for the bacterium Haemophilus influenzae). –  In 1996 the genome for Saccharomyces cerevisiae (baker's yeast, 12.1 Mbp) is sequenced. –  In 1997 the genome for E.coli (4.7 Mbp) is published. –  In 2001 first draft of the human Genome publish in Science. Genome was completed in 2003. Around 20-25 thousand genes. 3 billion bp.

Ref: Ouzounis C.A. and Valencia A. Early bioinformatics: the birth of a discipline--a personal view. 
 Bioinformatics. 2003 Nov 22;19(17):2176-90.

BLAST
•  The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.(NCBI website)

BLAST Result

10

The Start!!

Human Genome project

The Human genome project
•  The human genome project –  NIH (National Institutes of Health) & DOE (Department of Energy) started the Human Genome project in the US in 1990. –  Outside US, genome project took roots in many countries: UK, Italy, USSR (Russia), Canada, France etc… –  The Human Genome Organization was founded to coordinate all this work among scientists throughout the world. –  The first draft of the human genome was announced in May 2000 and published in 2001. –  A better quality draft was published in 2003.

Ref: Human genome project: http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml

11

Why?
•  •  The genomes projects improves understanding of biological system. Some current and potential applications of genome research include

–  – 

Molecular medicine •  Earlier detection of genetic predispositions to disease Bioarchaeology, anthropology, evolution, and human migration •  Study evolution

Ref: tree of life web project :http://tolweb.org/tree/phylogeny.html

–  – 

DNA forensics (identification) •  Exonerate persons wrongly accused of crimes Agriculture, livestock breeding, and bioprocessing •  Healthier, more productive, disease-resistant farm animals

Ref: Human genome project: http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml

Genome projects

Genome projects

12

Why Bioinformatics
A lot of information generated by all the genome sequencing projects.

Why Bioinformatics
A lot of information generated by all the genome sequencing projects.

Remember: August 15 2008 release of 1 database (GenBank) = 92748599 reported sequences

Computer needed!
1. The storage and retrieval 2. The analysis

13

1: Storage and Retrieval
•  Storage of data and retrieval = database –  The first : Dayhoff –  Genbank –  Etc…

2: Analyze
•  •  •  •  •  Sequence alignment Phylogenetic analysis Genome analysis Structure analysis The others…

Why?
•  Cutting-edge tool for extracting and representing important features in biological sequences are needed. Those features can reveal:
–  –  –  –  Evolutionary history Conserved motifs Common 2D or 3D structure Similar function

• 

•  • 

Those features can be use to characterize protein families. Those features can be used in database searches to identify potential member of a family.

14

Example
•  •  •  Hemoglobin is present in diverse organisms. From human to insect. It function is the same binds and transports oxygen. Big divergence between species during evolution. Alignment between human and chimpanzee show exact match.

• 

Alignment between human and an insect show very little similarity.

• 
•  • 

BUT they have the same function!
Pairwise alignment of human and chimpanzee don’t give clues about the important conserved features. Multiple alignment will do it!!!

Example
•  CD52 antigen
Human vs Monkey

Multiple alignment of the protein

2: Analyze
•  •  •  •  •  Sequence alignment Phylogenetic analysis Genome analysis Structure analysis The others…

15

Phylogenetics
•  Phylogenetics often makes use of numerical data, (numerical taxonomy) which can be scores for various “character states” such as the size of a visible structure or it can be DNA sequences. •  Similarities and differences between organisms can be coded as a set of characters, each with two or more alternative character states. •  In an alignment of DNA sequences, each position is a separate character, with four possible character states, the four nucleotides.

Phylogeny based on skeleton

From Tree of life web project : http://www.tolweb.org/tree/

•  Shared characteristics that define a grouping •  Phylogeny based on observed characteristics of the skeleton

•  •  •  • 

A B C D

aat ... ... ...

tcg ..a ..a ..a

ctt ..g ..c ..a

cta ..a ..c ..g

gga .t. ... ..g

atc ... ..t ..t

tgc ... ... ...

cta t.. ... t.t

atc ... ... ..t

ctg ..a t.a t..

Each nucleotide difference is a character


16

Phylogeny and bioinformatics
•  •  Sequence analyses like pairwise alignment, multiple alignment all use evolutionary theory as their basis of assumptions. After working with sequences for a while, one develops an intuitive understanding that for a given gene, closely related organisms have similar sequences and more distantly related organisms have more dissimilar sequences. These differences can be quantified.

• 

Given a set of gene sequences, it should be possible to reconstruct the evolutionary relationships among genes and among organisms.

What can be inferred from phylogenetic trees
–  Which species are the closest living relatives of modern humans? –  Did the Florida Dentist infect his patients with HIV? –  African eve? •  Analysis of mitochondrial DNA from 182 individuals –  Map pathogen strain diversity for vaccines. •  Influenza –  Study rapidly changing genes –  Next year’s strain can be predicted –  Flu vaccination can be developed

From Tree of life web project : http://www.tolweb.org/tree/

17

Did the Florida Dentist infect his patients with HIV?
DENTIST Patient C Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People: Patient A Patient G Patient B Patient E Patient A DENTIST Local control 2 Local control 3 Patient F Local control 9 Local control 35 Local control 3 Patient D
From Ou et al. (1992) and Page & Holmes (1998)

Yes:
The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist.

No

No

2: Analyze
•  •  •  •  •  Sequence alignment Phylogenetic analysis Genome analysis Structure analysis The others…

Comparative Genomics

Max Planck Institute for Evolutionary Anthropology http://www.macdevcenter.com/lpt/a/4973

18

What is comparative genomics
•  •  Analysis and comparison of genomes among different species. The geographical maps analogy.
•  Apartments, roads, shops, restaurants etc… •  2 cities have similarities and differences. •  Some differences can be important for the survival of the city other can only be a trace of something not important anymore.

• 

Comparing 2 genomes is similar.

Which genomes to compare

Do we compare human with chimpanzee or with mouse?

• 

• 

• 

We can learn more if we compare the human to a more distant relative like mouse. We are not so different than mouse. We have 4 limbs and 2 eyes. Our intern process are similar and we enjoy both a nice piece of cheese. Genes responsible for that should standout against a background of dissimilarities.

19

2: Analyze
•  •  •  •  •  Sequence alignment Phylogenetic analysis Genome analysis Structure analysis The others…

Why the structure?

Why the structure?

Because structure = function

20

2D and 3D Structures of RNA: Transfer RNA Structures
Secondary Structure Tertiary Structure TψC Loop Anticodon Stem D Loop

Variable loop

Anticodon Loop

2: Analyze
•  •  •  •  •  Sequence alignment Phylogenetic analysis Genome analysis Structure analysis The others: Microarrays, Genome assembly, Protein folding, Network analysis, Ontology, personal medicine and many others…

Conclusion
•  “Bioinformatics is an independent discipline with solutions of biological problems but with its own problems and solutions.” •  “This discipline will continue to evolve rapidly into the 21st century.” •  “Merging with nanotechnology, computing with biological matter is expect to transform our own lives and life on earth.”

21


				
DOCUMENT INFO
Categories:
Stats:
views:321
posted:7/11/2009
language:English
pages:21
Description: Birth of Bioinformatics