Introduction to Computational Genomics
Genome 373 Genomic Informatics William Stafford Noble
Outline
• Course logistics • Introduction to DNA sequencing
Web page
• Web site:
http://noble.gs.washington.edu/~noble/genome373/
• Page has links to
– Lecture notes – Handouts – Homework assignments
Team teaching
• Jim Thomas and I will (roughly) alternate weeks. • Jim’s part of the class focuses on bioinformatics applications. • My part of the class focuses on what goes on “under the hood.” • Aaron Klammer (TA) will teach additional topics not covered in class. • Material covered in section is required, and may be on the exams.
One-minute response
• Write for one minute at the end of each class. • Evaluate the lecture, point out sections that are unclear, make suggestions for next year • Do not save content questions for the oneminute response. Ask them in class. • I will begin each class by responding to the previous set of one-minute responses. • Sign your name. Anonymous evaluations will be done toward the end of the quarter.
Grading
• 50% weekly homeworks (handed out Wed. and due the following Wed.) • 20% midterm exam in class, Fri, May 4 • 30% final exam, Wed, June 6
• Homeworks are a mix of written problems and programming. • Exams are open book. • Final exam is cumulative, with an emphasis on the latter half of the quarter. • Grading is curved, with median = 3.6.
Programming
• The first several weeks of section will focus on learning to program. • Aaron will teach Python programming. • The first several homeworks will not include any required programming. • During section tomorrow, we will schedule some additional programming practice sessions. • You may turn in homework assignments in C, C++, Java, Perl, Python, Matlab. For other languages, please ask.
Textbook
• The required textbook is Bioinformatics: Sequence and Genome Analysis by Mount. • Readings will be assigned from this text. • The web page lists two recommended texts for people who are learning Python.
Background survey
Please write on the index card your • Name • Email address • Major • Primary background (biology or computation) • Amount of programming experience • Whether you have access to a laptop.
Introduction to DNA sequencing
Computer processing power doubles every 18-24 months.
Moore’s law
Growth of GenBank
Decreasing sequencing cost
Genome Sequence Milestones
• First complete viral genome reported in 1977: FX174 bacteriophage (5.4 Kb). • Complete phage l genome reported in 1983 (48.5 Kb). • First complete non-viral genomes in 1995: the bacteria Haemophilus influenzae (1.8 Mb) and Mycoplasma genitalium (0.6 Mb). E. coli completed in 1997. • First complete eukaryotic genome reported in 1997: Saccharomyces cerevisiae (12 Mb). • First complete metazoan genome reported in 1998: Caenorhabditis elegans (98 Mb). • Homo sapiens genome report (draft sequence) in 2001 (~3,000 Mb). • Pan troglodytes genome report (draft sequence) in 2005, ~99% identical to human.
Also done or nearly done
• • • • • • • • • • • • ~1,400 viral genomes ~200 eubacteria ~20 archaea Schizosaccharomyces pombe (~18 single celled fungi) Aspergillus nidulans (~10 mycelial fungi) Plasmodium falciparum (and ~10 other protists) Caenorhabditis briggsae (3 nematodes) Drosophila melanogaster (~15 insects) Danio rerio (~4 fish) Gallus gallus (chicken) Mus musculus (mouse) (~9 mammals) Arabidopsis thaliana (~4 plants)
In the next few years
• • • • • • Hundreds more bacteria and archae Dozens more fungi Many more insects and nematodes Many more mammals Several “primitive” chordates Increasing focus on genetic diversity, rather than single reference sequences. • Ecosystem sequencing (sequence of mixed organisms)
What are we learning?
• Completing the dream of Linnaean-Darwinian biology
– There are THREE kingdoms (not five or two). – Two of the three kingdoms (eubacteria and archaea) were lumped together just 20 years ago. – Eukaryotic cells are amalgams of symbiotic bacteria.
•
• •
Demoted the human gene number from ~200,000 to about 20,000. Establishing the evolutionary relations among our closest relatives. Discovering the genetic “parts list” for a variety of organisms.
Carl Linnaeus, father of systematic classification
DNA Sequencing by gel electrophoresis
1. Start at primer (restriction site) Grow DNA chain Include dideoxynucleoside (modified a, c, g, t) Stop reaction at all possible points Separate products with length, using gel electrophoresis
2. 3.
4. 5.
Slide from Serafim Batzoglou
Electrophoresis diagrams
Slide from Serafim Batzoglou
Challenging to read answer
Slide from Serafim Batzoglou
Challenging to read answer
Slide from Serafim Batzoglou
Challenging to read answer
Slide from Serafim Batzoglou
Reading an electropherogram
1. 2. 3. 4. Filtering Smoothing Correction for length compressions A method for calling the letters – PHRED – PHil’s Read EDitor
Phil Green
Slide from Serafim Batzoglou
Output of PHRED: a read
A read: 500-700 nucleotides
A C G A A T C A G …A 16 18 21 23 25 15 28 30 32 …21
Quality scores: -10log10Prob(Error)
Reads can be obtained from leftmost, rightmost ends of the insert
Double-barreled sequencing: (1990) Both leftmost & rightmost ends are sequenced, reads are paired
Slide from Serafim Batzoglou
Method to sequence longer regions
genomic segment cut many times at random (shotgun)
Get one or two reads from each segment
~500 bp
Slide from Serafim Batzoglou
~500 bp
Strategies for whole-genome sequencing
1. Hierarchical
i. ii. Break genome into many long pieces Sequence each piece with shotgun
Example: Yeast, Worm, Human, Rat, Rice
2. Whole genome shotgun
One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog
Slide from Serafim Batzoglou
Reconstructing the Sequence (Fragment Assembly)
reads
Cover region with ~7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region
Slide from Serafim Batzoglou
Problems that arise
• Non-random clone coverage (e.g. sequence that replicates poorly in E. coli). • Insufficient sequence overlap (too little information content to permit high probability matching). • Sequence heterogeneity (either from sequence read errors or because of polymorphism in the source DNA). • Repeat sequences.
In-class exercise
• You have been given a collection of sequence reads of a fixed length. • The error rate for your sequencing technology is less than 10%. • Assemble the correct sequence of letters, first on your own, then with other members of your team.
Let us not wallow in the valley of despair. I say to you today, my friends, that in spite of the difficulties and frustrations of the moment, I still have a dream. It is a dream deeply rooted in the American dream. I have a dream that one day this nation will rise up and live out the true meaning of its creed: "We hold these truths to be self-evident: that all men are created equal." I have a dream that one day on the red hills of Georgia the sons of former slaves and the sons of former slaveowners will be able to sit down together at a table of brotherhood. I have a dream that one day even the state of Mississippi, a desert state, sweltering with the heat of injustice and oppression, will be transformed into an oasis of freedom and justice. I have a dream that my four children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today. —Martin Luther King, August 28, 1963