Genomics
The Human Genome Project
Mapping and Sequencing the Genomes of
Model Organisms
Data Collection and Distribution
Ethical, Legal, and Social Considerations
Research Training
Technology Development
Technology Transfer
A Few Genome Resources
NCBI Genome Resources
UCSC Human Genome Browser
Ensembl Human Genome Server
Genome Sequencing Progress
NCBI Genome Sequence Repository
All organisms
Eukaryotic genomes
Prokaryotic genomes
Archaea genomes
Viruses
Genome Sequencing
From NCBI, 5/2001
Human Genome Sequencing 2/11/2001
From NCBI
Human Genome Progress 2/11/2001
Total Non-redundant Percentage of
sequence sequence (kb) genome
(kb)
Finished 1,140,365 1,040,372 32.50%
Unfinished 3,547,899 1,951,344 61.00%
Total 4,688,264 2,991,716 93.50%
From NCBI
Microbial Genomes
Published complete microbial genomes
Microbial genomes and chromosomes in
progress
Genome Informatics
Annotation and Analysis
Data Handling
Metabolic Reconstruction
Comparative Genomics
Functional Genomics
Genome Project Organization
Cloning
Mapping
Sequencing
Annotation
Analysis
Cloning and Mapping
Cloning
Large
YAC’s
1 Mb
BAC’s
100 - 200 Kb
Intermediate
Cosmids
Lambda clones
Small
Plasmids; M13
Mapping
Establishment of Guideposts
Aids in Assembly
Error Checking
Useful in mapping of genetic disorders
Genetic Maps
Cytogenetic markers
Linkage maps
Polymorphic loci screened by PCR to
determine inheritence patterns
Produce linkage map with nearby loci
Physical Maps
Radiation Hybrid/YACs/Cosmids
Restriction Sites
Sequence Tagged Sites
100 Kb resolution needed
30,000 STS’s
Expressed Sequence Tags
Detection
PCR
Hybridization
FISH
Fluoresecent in situ Hybridization
Human Genome STS Mapping Strategy
STS Content Mapping
Screen YAC’s by PCR
Radiation Hybrid Mapping
Screen RH Cell lines by PCR
Genetic Mapping
PCR Screening of polymorphic loci
Combine above to produce an integrated
map
Mapping Resolution
YAC mapping
1 Mb
Radiation hybrid mapping
10 Mb
Genetic map
30 Mb
GeneMap’98
Integrated Human Genetic Map
Over 30,000 unique gene-based markers
100 Kb resolution
http://www.ncbi.nlm.nih.gov/genemap98/
Map Integration
Human Chromosome 1 Genetic Map
Human Chromosome 1 Combination Map
Sequencing
Sequencing Methods
Random Shotgun
Ordered Shotgun
Directed
Primer Walking
Direct genomic sequencing
Random Shotgun Sequencing
Randomly shear or cut DNA into small pieces
2-4 Kb
Clone into M13, pUC or some other sequencing
vector
Sequence the clones from both ends
Rely on the computer to assemble the
sequences into one (or as few as possible)
contigs
Shotgun Sequencing Statistics
Lander and Waterman equation
poisson distribution
Po = e-m
probability that a base is not sequenced
where m=sequence coverage
H. influenza Sequencing
For 1X random sequence coverage = 1.8 Mb
P = 0.37 (63% of the bases are sequenced)
To get > 99% of the bases sequenced
5X coverage = 8.74 Mb of sequence
Po = e-5 = 0.0067
This coverage would leave approx. 128 gaps of
about 100 bp in size
From Science 269:496-512. 1995
Ordered Sequencing
Generate a set of large sequence clones in
lambda phage
May be subcloned from YACs or BACs as necessary
End sequence the lambda clones and order the
clones to produce a map of the genome
Choose a minimal tiling path of the genome from
the ordered lambda clones
Ordered Sequencing...
Shear and subclone the lambda inserts
that comprise the minimal tiling set into
sequencing vectors
Shotgun sequence and assemble each of
these lambda inserts individually
Assemble all sequences into one,
contiguous genome
Directed Sequencing
Process used for finishing following the
shotgun sequencing phase
Gap closure
Use specific sequencing primers to extend
appropriate clones into gap regions
Use specific sequencing primers to
sequence directly from genomic DNA
Sequence Assembly
Assembly of Shotgun Fragments
For H. influenzae (TIGR) 1.8 Mb
24,304 Sequence fragments were generated
for the random assembly phase
11,631,485 bases
Generated 140 contigs
Assembled using the TIGR Assembler
30 hours of cpu time
phred/phrap/consed
Widely used programs for sequence:
base calling (phred)
assembly (phrap)
editing (consed)
Developed at the University of Washington
Phil Green (phrap)
Brent Ewing (phred)
David Gordon (consed)
Genome Annotation and Analysis
Pattern Matching
Sequence Annotation
ORF identification
Frameshift resolution
Genome map construction
Functional assignments
Metabolic pathway assignment
Metabolic pathway Reconstruction
Comparative analysis
Annotation Tools
Semi-automated
Manual
MAGPIE
Multipurpose Automated Genome Project
Investigation Environment
Terry Gaasterland et. al.
http://genomes.rockefeller.edu/magpie/magpie.htmlAutomated
Semi-automated analysis tool for microbial
genome projects
MAGPIE Example
Non-Automated Analysis and Prediction
The Ureaplasma urealyticum genome
database
Run analysis tool
Parse results
Dump results into the database
View results
Manually annotate
Genomic Sequence Database
Data Storage
Sequence
Gene Map
Annotation
User Interface
Web browser
Customizable
The Ureaplasma urealyticum Genome Project
Uu - 751,719 bp
http://genome.microbio.uab.edu/uu/uugen.htm
Web-based genome analysis tool
Annotation Problems
Problems with existing sequence databases
Incomplete datasets
Skewed datasets
Incorrectly annotated records
Annotations based on experimental vs. predicted
data
Nomenclature differences
Transitive errors in gene function predictions
Functional predictions for “hypothetical” genes
Metabolic Pathway Reconstruction
Metabolic Pathway Reconstruction
Role assignment
Extract metabolic pathways from genomes
Navigation and analysis
Pathway editing
Metabolic Assignments
Amino acid Biosynthesis
Biosynthesis of cofactors, prosthetic groups, and carriers
Cell envelope
Cellular processes
Central intermediary metabolism
Energy metabolism
Fatty acid and phospholipid metabolism
Purines, pyrimidines, nucleosides, and nucleotides
Regulatory functions
Replication
Transcription
Translation
Transport and binding proteins
Other categories, Unassigned
Hypothetical
Ureaplasma urealyticum Gene Map
1 50,000
50,001 100,000
100,001 150,000
150,001 200,000
200,001 250,000
250,001 300,000
300,001 350,000
350,001 400,000
400,001 450,000
450,001 500,000
500,001 550,000
550,001 600,000
600,001 650,000
650,001 700,000
700,001 750,000
750,001 751,719 Cofactor Biosynthesis Energy Metabolism Replication Other
Cell envelope Fatty Acid Metabolism Transcription RNA
Cellular processes Hypothetical Translation
Central Intermediary Metabolism Nucleotide Metabolism Transport tRNA
Uu Genes Mg Genes
Percent Percent
Role # of Total # of Total
Amino acid Biosynthesis 1 0.2% 0 0.0%
Biosynthesis of cofactors 10 1.7% 7 1.5%
Cell envelope 19 3.1% 26 5.4%
Cellular processes 13 2.1% 15 3.1%
Central intermediary metabolism 15 2.5% 7 1.5%
Energy metabolism 23 3.8% 30 6.3%
Fatty acid - phospholipids 6 1.0% 7 1.5%
Hypothetical 293 48.3% 169 35.3%
Other categories 1 0.2% 3 0.6%
Purines, pyrimidines 18 3.0% 20 4.2%
Regulatory functions 4 0.7% 4 0.8%
Replication 45 7.4% 31 6.5%
Transcription 17 2.8% 19 4.0%
Translation 100 16.5% 99 20.7%
Transport and binding proteins 37 6.1% 35 7.3%
Unassigned 4 0.7% 7 1.5%
Total 606 100.0% 479 100.0%
EcoCyc
Peter D. Karp, PhD
SRI International
Menlo Park, CA
http://ecocyc.pangeasystems.com/ecocyc/
ecocyc.html
Pathway Reconstruction
Cell
Metabolic Network
Pathways
Annotated Genome
List of Gene Products Reactions (Compounds)
List of Genes/ORFs Gene Products
DNA Sequence Genes
Genomic
Maps
Adapted from P. Karp, Pangea Systems
Glycolysis in Uu?
glucose-1-phosphate
? phosphoglucomutase
glucose-6-phosphate
phosphoglucose isomerase
fructose-6-phosphate
6-phosphofructokinase
fructose-1,6-bisphosphate
fructose bisphosphate aldolase
glyceraldehyde-3-phosphate
glyceraldehyde 3-phosphate
glyceraldehyde-3-phosphate dehydrogenase
dehydrogenase 1.2.1.12
1.2.1.9 3-phospho-D-glyceroyl-phosphate
phosphoglycerate kinase
3-phosphoglycerate pyruvate
Uu Energy Metabolism
Glycolysis
Missing several components
Pentose-phosphate pathway
Only 2/8 enzyme complexes present
Proton motive force - ATP synthase
complex
Urease Gene Complex
Biologically relevant
Comparative Genomics
What makes one organism different from
all other organisms?
Molecular Biology
Physiology
Pathogenesis
Epidemiology
Genetics
Ortholog Comparisons
Uu to Mg genes: 324
53% of Uu; 67% of Mg
71 hypothetical
Mh to Mg genes: 314
41% of Mh; 57% of Mg
55 hypothetical (2 – “unique hypothetical”)
Mh to Uu genes: 330
47% of Uu; 43% of Mh
82 hypothetical (19 – “unique hypothetical”)
M. genitalium - M. pneumoniae Gene Order
500,000
400,000
300,000
200,000
100,000
0
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000
M. pneumoniae Gene Position
M. genitalium - U. urealyticum Gene Order
500,000
400,000
300,000
200,000
100,000
0
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000
U. urealyticum Gene Position
Paralog Analysis
Identification of conserved, paralogous
groups
All against All comparison
Genes within one organism
Identifies groups of “related” genes
Primary sequence
Structure
Function
Uu Paralogous Clusters >3
4 tRNA synthetase
4 Translation factors
4 Hypothetical membrane lipoprotein
5 ATP synthase alpha, beta chains
6 MBA
7 Hypothetical membrane lipoprotein
8 Hypothetical
10 Iron transporters
13 Transporters
Functional Genomics
Gene Expression
Gene Regulation
Genome-wide Mutagenesis
Expression Arrays
Cell growth in different environments
Isolate cDNAs
Measure expression using array technology
Create database of expression information
Display information in an easy-to-use format
Show ratio of expression under different conditions
Putting it all together
From F. Blattner, U. Wisc.
Chromosome Views
Ensembl view
UC Santa Cruz view
NCBI View
A Final Caveat
“The difficulty of identifying genes in
anonymous vertebrate sequences”
Claverie JM, Poirot O, Lopez F
Comput Chem 1997;21(4):203-14
The identification of genes in newly determined vertebrate genomic
sequences can range from a trivial to an impossible task. In a
statistical preamble, we show how "insignificant" are the individual
features on which gene identification can be rigorously based:
promoter signals, splice sites, open reading frames, etc. The practical
identification of genes is thus ultimately a tributary of their
resemblance to those already present in sequence databases, or
incorporated into training sets. The inherent conservatism of the
currently popular methods (database similarity search, GRAIL) will
greatly limit our capacity for making unexpected biological
discoveries from increasingly abundant genomic data. Beyond a very
limited subset of trivial cases, the automated interpretation (i.e.
without experimental validation) of genomic data, is still a myth. On
the other hand, characterizing the 60,000 to 100,000 genes thought to
be hidden in the human genome by the mean of individual
experiments is not feasible. Thus, it appears that our only hope of
turning genome data into genome information must rely on drastic
progresses in the way we identify and analyze genes in silico.
Only One Final Word of Wisdom...
“...although the computer is a wonderful
helpmate for the sequence searcher and
comparer, biochemists and molecular
biologists must guard against the blind
acceptance of any algorithmic output;
given the choice, think like a biologist and
not a statistician.”
- Russell F. Doolittle, 1990