Oct 10 2006 Features and evolution of bacterial genomes
Reading For today: 47-55 in G&M For Thursday: 118-119, 122-126 in G&M
This week
1 2 3 Features of bacterial genomes Processes in bacterial genome evolution Functional categorization of genes (Thurs)
1
Features of Bacterial Genes and Genomes known before genomics era
• Size - 0.5-10 Mb (~500 to 10,000 genes); average gene length=~1kb
•Bacterial genomes are tightly packed with genes: little repetitive, transposable, & non-coding DNA, considered to have no or few pseudogenes • Bacterial genes lack introns, are arranged as operons, located on both strands • Chromosomes are (typically) circular, with a single origin and terminus region • Gene order conserved among closely related bacteria • Close relatives can differ in gene content, eg, genes conferring pathogenicity • Base composition varies (25-75% G+C), similar in closely related species • Base composition is relatively homogeneous over the entire chromosome • Within a species, each codon position has a characteristic G+C content. (and there is a species-specific pattern of codon usage) All of these “facts” based on a few model species, such as E. coli, Bacillus
bacterial genome sequencing as of Oct 2006
• 429 sequenced and available in Bacteria • 36 in Archaea • Many others in progress – eg at GOLD http://www.genomesonline.org/ • Finished sequences are more accurate than sequences of eukaryotic genomes • Includes soil, marine, pathogenic, commensal, symbiotic, extremophilic, etc. bacteria • Possible to compare distant and related species to infer evolutionary changes in genomes
2
Depicting Full Genome Sequences
____________________________________________________________
Genes on lagging %GC
Genes on leading
GC skew
(G-C)/G+C)
tRNA
rRNA
Repeats (<30 nt)
(Adapted from Bao et al. 2002. Genome Research)
Features of Bacterial Genomes
• Gene Inventory
– Genome as a bag of genes
• Non-coding sequences
– Spacers and functional features
• Base Composition
– Variation among and within genomes
• Gene Order and orientation
3
Gene Inventory
• Complete gene set
– Information only available from full genome sequences
• Comparisons started with 2nd genome sequenced (Mycoplasma genitalium, 1995).
– 25-fold differences among sequenced bacteria in #ORFs (470 to ~10000)
• Now 50-fold because Carsonella symbiont is only 180 ORFs! (see Science on Friday Oct 13)
– Distinct gene sets in genomes of similar size
• Observations so far indicate up to 25% difference in gene content for strains of the same species • Much larger differences for unrelated bacteria
Complete gene inventories and functions
______________________________________________________________________ _________
4
Small genomes lack many genes for biosynthesis, transport, catabolism, signal transduction, unknown function. Little loss in genes for translation
Xylella fastidiosa, Yersinia pestis, Vibrio cholerae, Pseudomonas aeruginosa, Pasteurella multocida, Neisseria meningitidus, Haemophilus influenzae, Escherichia coli K12, Buchnera aphidicola
Minimal gene set
• • • • Idea of “Minimal Genome” consisting of essential genes Each genome contains minimal gene set + additional niche-specific genes Minimal Gene Set would be those required for central processes of cell-expected to be universally distributed and to determine the minimal genome size Based on size determinations from Pulsed Field Gel Electrophoresis (CHEF gels), minimal genome sizes had been observed to be similar among diverse groups of bacteria: ~ 500 kb – Mycoplasmas – Rickettsiae – Symbiotic species of gamma-Proteobacteria – All tiny genomes are obligately host-dependent – Can be pathogenic or mutualistic in hosts – Can be close relatives of large genome organisms (Buchnera close to E. coli)
5
1996 Mushegian AR, Koonin EV. 1996 A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc Natl Acad Sci U S A 93:10268
“To derive such a set, we compared the 468 predicted M. genitalium protein sequences with the 1703 protein sequences encoded by the other completely sequenced small bacterial genome, that of Haemophilus influenzae. M. genitalium and H. influenzae belong to two ancient bacterial lineages, i.e., Gram-positive and Gram-negative bacteria, respectively. Therefore, the genes that are conserved in these two bacteria are almost certainly essential for cellular function. It is this category of genes that is most likely to approximate the minimal gene set. We found that 240 M. genitalium genes have orthologs among the genes of H. influenzae.”
Minimal gene set, 2005
• Idea of “Minimal Genome” -- doesn’t work out as first envisioned. • The number of universal genes has decreased with each genome sequenced… • Only ~60 genes universal among sequenced genomes by 2003 • This is possible due to “non-orthologous” displacement, ie different gene sets to do the same job.
Koonin Nature Rev. Microbiol 2003
6
A case of non-orthologous displacement
lysyl tRNA synthetase
Class II present in most bacteria and eukaryotes, crenarchaeota COG1190
Class I present in euryarchaeota, spirochetes most alphaproteobacteria COG1384
Perfectly complementary distribution among taxa Unrelated sequences - no indication of homology
This is one of many cases--but some are not so clear.
COG = “cluster of orthologous groups”, system for classifying genes by homology (topic for Thursday)
Koonin 2003
The ubiquitous genes Ribosomal proteins Aminoacyle tRNA synthetases Translation factors Enzymes of RNA and protein modification Signal recognition components in secretion Molecular chaperone/protease RNA polymerase subunits DNA polymerase subunit, exonuclease, topoisomerase Total 30 15 6 3 3 1 2 3 63
Koonin 2003
7
Defining the Minimal Genome
Combining genome sequence analysis and experimental studies
Hutchison CA, Peterson SN, Gill SR, Cline RT, White O, Fraser CM, Smith HO, Venter JC. 1999. Global transposon mutagenesis and a minimal Mycoplasma genome. Science 286:2165
Presumed Dispensable Dispensable + Essential
Mycoplasma genitalium 480 genes Mycoplasma pneumoniae 677 genes
How much does gene set differ between genomes? drops off quickly with divergence of orthologous genes
MA Huynen, P Bork. 1998. Measuring genome evolution. PNAS 95: 5849
8
Extensive divergence in gene sets of E. coli strains
R. Welch et al. PNAS 2002.
Features of Bacterial Genomes
• Gene Inventory
– Genome as a bag of genes
• Non-coding sequences
– Spacers and functional features
• Base Composition
– Variation among and within genomes
• Gene Order and orientation
9
Genome size and ORF content in fully sequenced Bacteria
obligate pathogen or symbiont 8000 7000 6000 5000 4000 3000 2000 1000
1 2 3
freeliving or facultative pathogen
# intact ORFS
For the most part, genomes are tightly packed with genes-very different from eukaryotes
Mycobacterium leprae Many recent and older pseudogenes
4 5 6 7 8
Size of Genome (# megabase)
When genes are lost--leave behind legacy of decaying ORFs, -- may be identifiable only if recent
Features of Bacterial Genomes
• Gene Inventory
– Genome as a bag of genes
• Non-coding sequences
– Spacers and functional features
• Base Composition
– Variation among and within genomes
• Gene Order and orientation
10
The smallest genome size and G+C content of any cellular organism
)! (!
GC content (%)
'! &! %! $! #! "! ! ! " # $ % & ' ( ) * "!
Genome size (Mb)
Primary endosymbionts of insects Other bacteria
16.5%
Features of Bacterial Genomes
• Gene Inventory
– Genome as a bag of genes
• Non-coding sequences
– Spacers and functional features
• Base Composition
– Variation among and within genomes
• Gene Order and orientation
– Usually similar between close relatives – Symmetry of inversions seen between related genomes (“X-Plots”) – Almost complete lack of correspondence between distant relatives (different phyla) except for a few conserved operons
• Ribosomal proteins
11
Eisen et al. 2000
Eisen et al. 2000
12
Processes affecting bacterial genomes
• • • • Mutational bias Lateral gene transfer Deletions and gene loss Rearrangements
Differences in Base Composition among Bacteria are caused by Mutational Biases
Sueoka 1962 Effect of mutational bias on genome base composition Affected by position on leading v lagging strand, transcription Much variation among genomes due to differences in mutational processes (repair gene sets, etc)
G+C
A+T
13
Compositional Heterogeneity among Bacterial Genomes
(Muto & Osawa, 1987)
80
6 0
Why is base composition most conservative at 2nd positions and least conservative at 3rd positions? Consistent with mutational biases as main basis of differences among species in base composition. Stronger purifying selection at 2nd and 1st positions.
20 40 60 80
40
20
Species-specific genes have atypical base compositions
Sequenced genes present in Salmonell a , but not in E. c (circa 1990) oli Gene Potential Function Cobinamide synthesis Flagell ar synthesis Flagell ar synthesis Host re cognition/invasion Sialidase Unknown Envelope protein Phosphatase LPS Synthesis Tra nscriptional control Tricarboxylate tr ansport Phosphoglycerate transport Ma p 41 56 56 59 20.5 98 25 96 31 7 57 – %G+C 59.3 40.9 52.3 45.5 40.9 38.2 43.4 46.5 33.5 39.8 55.0 45.3 CAI .233 .210 .216 .261 .263 .296 .274 .248 .175 .218 .278 .277
cbi fljA fljB inv/spa nanH ORF pagC phoN rfc sinR tctABCD pgtE
Salmonell a genome is 52% G+C
Early indicator of gene transfer:
14
Processes affecting bacterial genomes
• • • • Mutational bias Lateral gene transfer (LGT) Deletions and gene loss Rearrangements
• LGT can cause • sporadic distribution of a gene among related taxa. • mosaic organisms, with some sequences having ‘atypical’ features. • distinct gene content for closely related bacteria • unexpectedly high similarity between sequences from different taxa. • different phylogenies for different genes (“phylogenetic inconguence”).
15
Why is LGT important to bacterial evolution?
1. Genomes can no longer be view as constant. 2. May foil attempts to reconstruct the evolutionary relationships among organisms. 3. Conflicts with typical genetic and evolutionary processes. 4. Important in adaptation since acquired genes are often ecologically relevant.
(from Doolittle. 1999. Science)
Standard view of organismal relationships
Doolittle’s revised view of organismal relationships
With rampant LGT, all genomes are chimeric and classification might be difficult.
Methods for detecting LGT
• Abrupt shifts in sequence features (such as %GC, codon use) along chromosome • Conflict in gene trees • Gene content mapped onto phylogeny: genes present in some strains, absent in basal lineages • Comparisons of gene order, associations with phage sequences…
16
Detection of LGT
• Abrupt shifts in sequence features (such as %GC, codon use) along chromosome • Conflict in gene trees • Gene content mapped onto phylogeny: genes present in some strains, absent in basal lineages • Comparisons of gene order, associations with phage sequences…
Lateral Transfer as a Source of “Atypical” &“Species-specific” Genes Atypical” &“Species-specific”
_______________________________________________________
– +
E. coli Salmonella
LGT + Species with Distinct G+C
One of the original ways that LGT was detected was through atypical sequence composition of recently acquired fragments.
17
Bacterial genomes have characteristic GC content but most contain distinct regions of deviant GC content
Streptococcus mutans, 37% G+C Neisseria meningitidis, 52% G+C
?
(from Adjac et al. 2002. PNAS)
(from Tettelin et al. 2000. Science)
Base composition (deviation from mean)
(GC of sliding windows)
Massive gene exchange in microbial genomes
Inferred from atypical base composition
Amount of LTG is possibly underestimated by this compositional approach -genes from species of similar G+C content are not detected
18
Methods for detecting LGT
• Abrupt shifts in sequence features (such as %GC, codon use) along chromosome • Conflicts among gene trees • Gene content mapped onto phylogeny: genes present in some strains, absent in basal lineages • Comparisons of gene order, associations with phage sequences…
Conflict of gene trees
Mycobacterium leprae Mycobacterium tuberculosis Streptomyces coelicolor Aquifex aeolicus Synechocystis sp. Pyrococcus horikoshii Pyrococcus abyssi Methanococcus jannashii Methanobacterium thermoautotrophicum Archaeoglobus fulgidus Campylobacter jejuni Helicobacter pylori Thermotoga maritima Caulobacter crescentus Deinococcus radiodurans Halobacterium sp. Thermoplasma acidophilum Caenorhabditis elegans Chlamydophila pneumoniae Xylella fastidiosa Saccharomyces cerevisiae Pseudomonas aeruginosa Vibrio cholerae Pasteurella multocida Haemophilus influenzae Neisseria meningitidis Buchnera sp. Aeropyrum pernix Sulfolobus solfataricus
Bacteria Archaea Eukaryota
Orotate Phosphoribosyltransferase
19
Phylogenetic Detection of LGT in Bacteria
Evidence based on best hits via BLAST: Aquifex aeolicus and Thermotoga maritima, both
hyperthermophilic bacteria, each have a high percentage (20% & 25%, respectively) of ORFs most similar to archaeal genes.
Sometimes BLAST is not a good indicator of phylogeny
Do approaches based on phylogenetic incongruence and on sequence composition reveal the same cases of gene transfer?
Mycobacterium leprae Mycobacterium tuberculosis
E. coli
400 350 300
Streptomyces coelicolor Aquifex aeolicus Synechocystis sp. Pyrococcus horikoshii Pyrococcus abyssi Methanococcus jannashii Methanobacterium thermoautotrophicum Archaeoglobus fulgidus
Bacteria Archaea Eukaryota
Gene number
17.6%
Halobacterium sp.
Campylobacter jejuni Helicobacter pylori Thermotoga maritima Caulobacter crescentus Deinococcus radiodurans Thermoplasma acidophilum
250 200 150 100 50 0 10 20 30 40 50 60 70 80 90
Caenorhabditis elegans Chlamydophila pneumoniae Xylella fastidiosa Saccharomyces cerevisiae Pseudomonas aeruginosa Vibrio cholerae Pasteurella multocida Haemophilus influenzae Neisseria meningitidis Buchnera sp. Aeropyrum pernix Sulfolobus solfataricus
%GC3
Orotate Phosphoribosyltransferase
Atypical sequence composition
Phylogenetic incongruence
Two aspects of the same phenomenon?
20
Comparing results from using gene content vs sequence features for detecting acquired genes - overall agreement
40
Transfer recognized by: Compositional features Taxa distribution Both criteria
30
20
10
kb
(adapted from Lawrence & Ochman 2002)
Detection of LGT
• Abrupt shifts in sequence features (such as %GC, codon use) along chromosome • Conflicts among gene trees • Gene content mapped onto phylogeny: genes present in some strains, absent in basal lineages • Comparisons of gene order, associations with phage sequences…
21
Inferring LGT by comparing genome contents within a phylogenetic context
A
Acquisitions: +A – B – C Losses: –A + B + C
Does gene content indicate the same gene transfers as phylogenetic incongruence?
B
C
Gene gain/loss vs. phylogenetic disruptions for genes present in all taxa
BuchBp
Escherichia 933 K12 LT2 Yers Salmonella CT18 LT2 K12 Yers
4600 259 38 4554 9 10 282 13
BuchSG 545
BuchAP 564
Buchnera
5349 4289 522 144 22 114
1 11
This topo: 1589 Other topo: 9 Unresolved: 116
This topo: 1603 Other topo: 3 Unresolved: 97
This topo: 237 Other topo: 0 Unresolved: 169
Enterobacteriaceae
Streptococcus
Rhizobiaceae
lactococcus
pyog 1865
pneumon
Sino 4080
2124
5004
K12
4289 549 54 601 29
4554 232 56
agal
K12 LT2 Yers Vibrio
337 19
214 117
519 113
This topo: 923 Other topo: 0 Unresolved: 222
This topo : 93 Other topo: 28 Unresolved: 564
This topo : 616 Other topo: 34 Unresolved: 360 (adapted from Daubin et al. 2003 Science)
Meso
Brucella
Agro
22
Gene gain/loss vs. phylogenetic disruptions for genes present in all taxa
Genes gained and lost based on gene sets of 3 taxa.
Trees derived from genes shared by all 4 taxa.
“This topo” = topology of 16S rRNA tree.
(adapted from Daubin et al. 2003 Science)
The cohesion of bacterial genomes
Other topologies 1 rRNA topology 1
Salmonella Escherichia Staphylococcus Enterobacteria Rhizobia
0
LGT
Interspecies comparisons Intraspecies comparisons
0
0
Buchnera
Staphylococcus aureus Chlamydophila pneumoniae Streptococcus Escherichia coli
1 Non-resolving
(adapted from Daubin et al. 2003 Science)
23
Conclusion: LGT is not a usual basis for phylogenetic incongruencies
No clear evidence for rampant LGT of orthologous genes
(Frequencies of LGT are low; <5% for interspecies comparisons) But it does occur for a small proportion of genes.
This implies two classes of genes in bacterial genomes
1. Orthologous genes
- Generally resistant to LGT, used for phylogenetics
2. Recently acquired genes
- Most of unknown function (many annotated as phage or secretion proteins) - Most are orphans (no known homologs, phylogenetically uninformative) - Generally, more A+T-rich than host genome -Often prophage-associated in some genomes -Not used for phylogenetics much due to erratic distribution
CO 92
m
lpis
osa
hid ico la P. mu ltoc ida
sa ca mp es tris
X.
him uri u
KIM
nza
ler ae
vip a
gin
fas tidi o
stis
stis
ue
bre
eru
col i
pe
cho
infl
P. a
W.
S.
Y.
B.
X.
H.
New genes arriving, Some persist in subclade
E.
Y.
V.
X.
axo
typ
pe
ap
no
po
dis
e
24
Distribution and Age of Acquired Genes in E. coli
(adapted from Lawrence & Ochman 1998)
Processes affecting bacterial genomes
• • • • • Mutational bias Lateral gene transfer Deletions and gene loss Rearrangements Duplication
25
Role of deletion in Bacteria
In the face of so much gene acquisition, why are bacterial genomes so small? Close packing of bacterial genomes implies that inactivated genes are removed relatively quickly. Deletional bias in mutation removes non-functional DNA. Selection is required to retain DNA in a genome.
Dynamics of Bacterial Genome Size
Gene acquisition
Deletions of fragments spanning multiple genes
Bacterial Genome Size
Gene inactivation by point mutation Pseudogene erosion by small deletions
Duplication
Chronic pathogens:
low rates of DNA acquisition
Decrease in s & Ne increases gene loss through genetic drift
A. Mira et al. 2001 Trends in Genetics
26
Processes affecting bacterial genomes
• • • • • Mutational bias Lateral gene transfer Deletions and gene loss Rearrangements Duplication
27