COMPARATIVE GENOMICS: GENOME-WIDE ANALYSIS IN METAZOAN EUKARYOTES
Mao-Feng Ger 02/08/2006
• Completely sequenced genome could be used for large-scale comparative analysis • Effective methods for enormous data are objectives • Main areas in comparative genomics:
– Whole-genome alignment – Gene prediction – Regulatory-region prediction
• • • • •
Introduction Whole-genome alignments Gene prediction Finding regulatory regions conclusion
Sequenced genomes
• In NCBI Genomic Project Database, up to 02/05, the # of genome project:
– Archaea: 51 (complete , draft assembly, in progress) – Bactria: 821 (complete , draft assembly, in progress) – Eukaryota: 391 (complete, draft assembly, in
progress, organelles)
• In eukaryota, 174 are metazoans.
Comparative genomics
• Presumption: two genomes are from a common ancestor, so every bp is the combination of the original genome and the action of evolution • Evolution: mutation + selection
– Can be represented by a rate matrix
• Selection :
– Negative selection – Neutral selection – Positive selection
BLOSUM and PAM rate matrices
• PAM (Percent Accepted Mutation)
– From a set of proteins which are at least 85% identical – Numeric suffix means the number of self multiplication
• BLOSUM (BLOcks SUbstitution Matrix)
– More empirical and from a large dataset – Contructed by extracting ungapped segments (blocks) from a set of aligned protein families – Numeric suffix means at least x% identity to the blocks
Difficulties in aligning genomes
• Knowing so little about evolution processes that we’d better focus on functional sequences • Because genome size differences and genome readiness, doing whole-genome alignments is pretty difficult • Recently, there are more and more programs dealing with large-scale comparisons. Biologist need to know these approaches.
• • • • •
Introduction Whole-genome alignments Gene prediction Finding regulatory regions conclusion
Precomputed alignments
• Several groups have made large crossspecies comparisons
– UC Santa Cruz/PennState (translated BLAT or BLASTZ) – Berkeley Genome Pipeline (BLAT/AVID) – Ensembl (Phusion/Blastn)
Whole-genome alignment
Which genome to align
• Sufficient similarity between genomes enable the easy identification of homologous regions
– Example: DNA alignment between human and mouse resulted in finding new genes and gene regulatory regions
• Alignment between human and puffer fish, though less easy, is still feasible
Comparing genomes at protein level
• Not closely related genomes might have problems to align genomes at nucleotide level • At protein level, might lost info which can help finding new genes and regulatory sequences • It is better to start from closely related genomes
Alignment strategy
• Dynamic programming makes alignment tractable as long as you follow a few rules • Needleman/Wunch: align sequences globally
• Smith/waterman: align sequences locally
– No negative score, at least 0 – Tracing to 0
• However, limitations:
– Cannot handle rearrangement such as inversion, duplication, translocation – For long sequence (>10,000 bp), very expensive in time and memory usage
Seeding strategy
• Because correct alignment comes from stretches of ungapped matches • So, first finding a set of ungapped matches (seeds) • Then, extending gapped alignment from where seeds happens. • Loss in sensitivity but reward in time and memory usage • Consecutive model and Two weightedspaced model
• Simply put,
– Seeding – Seeds used as nucleation point for extension – Dynamic programming to produce gapped alignments
• In this review, we focus on 4 whole-genome alignment methods
– – – – BLASTZ BLAT/AVID BLAT/LAGAN WABA
BLASTZ
• Local aligner, like BLASTZ and BLAT, are highly sensitive but less specific • BLASTZ applies several methods to increase sensitivity and specificity
– Seeding: instead of 11 consecutive model, new BLASTZ used two weighted-spaced model(12 of 19 and tolerate a transition among 12) – Extend the seeds without gaps – Extend gapped alignment: down-weight lowcomplexity matches first
• In mouse-human alignment case, using a specific scoring matrix from known mousehuman homology region • A post-processing step is needed to sort out the most significant orthologues in multiple matches • Overall, BLASTZ covered 98% coding region in mouse and human genome, indicating it is highly sensitive for identifying well-conserved regions
BLAT
• A local aligner:
– Untranslated: designed to align cDNA to genomic sequences and less effective at < 90% identity – Translated mode: more effective in genome comparison. With mask for repeats and lowcomplexity, the output is faster and cleaner
• Produce a set of ungapped alignments, good in speed at the expense of overall sensitivity • Used in human-pufferfish genome comparison
Global alignment
• 3 steps:
– – – – 1: finding the maximal repeated region 2: clean matches first, then repeat matched Recursively step 1 and 2 3: <4kb, use NW algorithm; > 4kb, no significant alignment
AVID
• Assumption: strictly homologous and no gene duplication, inversion, translocation • When apply to a whole-genome, it needs a preprocessing step to identify syntenic regions
LAGAN
• The advantage of LAGAN over AVID is that it can align larger sequences
– Lower memory requirement – Different matching algorithm in step 1 (not necessary to find exact matches)
• In conjunction with BLAT, it has been applied to rat-human and rat-mouse comparisons
MLAGAN
• An extension of LAGAN • Can do multiple alignment • Align closely related genomes first, then incorporate others in order of phylogenetic distance
WABA
• Take genetic code degeneracy into consideration • Seeding step: based on nucleotides and use two weighted-spaced rule 6of8, which allow the third position to mismatch • No extension step, but group proper seeds to define homologous regions
Biological correctness
• There is no best way to do alignments
– Know evolution inadequately to indicate which one is superior – Different algorithms are tuned to different genome comparisons (ex. BLASTZ in humanmouse case and WABA in C. elegan–C. briggsae case) – Purposes are different
• Align as much as possible, regardless of selection (ex. AVID, LAGAN, BLASTZ) • Identify conserved regions which are under selection (ex. BLAT, WABA)
• Most programs concern maximizing the homologous bps, while biologist are interested in conserved regions for a function. • For example, in the mouse-human alignment, 40% are alignable, but only 6% are under selection • To make things worse, substitution rate varies across genome.
• • • • •
Introduction Whole-genome alignments Gene prediction Finding regulatory regions conclusion
Defining gene structure
• Still a challenge because of poor signal-tonoise level • Comparison between closely related genomes could provide additional info (“dual genome” gene predictors) • Different programs are with different presumptions, so users need to know the strengths and limitations
Dual genome gene predictor
• To model a specific type of negative selection • Assumption: in alignments, most differences are neutral and regions without many mutations are conserved. • Combine other info, such splicing, wobble effect… to get a better model
• Can be subdivided into 3 classes:
– Pair-HMM: take math approach to determine joint gene structure and alignment – Informant approaches: fix on alignment to provide a better gene prediction – Exon-finding approaches: try to demark the exons without splicing them together
Pair-HMM
• HMM can be used to predict gene structure in a single genome • Paire-HMM can find the most likely path to have generate these sequences and provide the alignment as well as gene prediction • Contain two set of orthologous genes
• Two pair HMM approaches:
– SLAM – DoubleScan
• Both need to optimize parameters for a specific species and better efficiency • SLAM uses AVID method to do rough alignment, while DoubleScan uses BLAST
Informant appraoches
• Use only one sequence to predict gene structure, and other one sequence is just for additional info by its alighment • Can predict not only genome sequence but also different inputs, like unassembled reads • 3 methods are available: TwinScan, SGP-2, GenomeScan • Need precise parameters, so have their own alignment methods (often BLAST)
Exon prediction
• A carefully parameterized TBLASTX method designed to provide specific exon prediction from Tetraodon • Sacrifice a certain amount sensitivity for high specificity
Which method to use
• A particularly successful way to do this work used informant methods combined with some simple criteria • Produce a strong prediction in mammalian genomes
• • • • •
Introduction Whole-genome alignments Gene prediction Finding regulatory regions conclusion
Finding regulatory regions
• Called phylogentic footprinting (analogous with DNAase footprinting) • Functionally important regions are mutated less • These cis-regulatory motifs can be dertermined by:
– Finding common motifs in orthologous sequences – Aligingn orthologous sequences first, then indentifying common regions
• Previously known motifs might help
Which region to use
• 5’ and 3’ flanking regions as well as intronic sequences • Difficulties in finding regulatory regions:
– 5’ end is often the least well-defined, so we need experimental evidence of promoters – Enhancers could be several kilobases away
• In addition to experimental evidence, guessing and systematic comparison is needed to potential cis-regulatory regions
• Two orthologous genes might have very different regulatory cis-elements, such as paralogous genes • How evolution affects cis-regulatory motifs is still poorly understood • Intra-mammal comparison show a large amount of non-functional conservation, while in intravertebrate, it is hard to detect
Evolutionary issues
• Neutral drift effect could destroy or create cisregulatory sites at a certain rate • However, expression pattern could remain little/no changed • Raise the possibility of compensate mutation • Recently, some researchers try to distinguish regulatory regions from neutrally evolving DNA by genome sequence alignments
• Motif finding programs do not consider the phylogenetic relationship between homologous sequences • Can be overcome by identifying DNA motifs evolving at a slower rate than the surrounding sequences • All motif-finding techniques work better with increasing amounts of sequences
Motif overrepresentation
Alignment for finding regulatory region
• Aligning regions of homology in the noncoding regions near the orthologous genes • More and more researches show that cisregulatory elements are in non-conserved regions
• • • • •
Introduction Whole-genome alignments Gene prediction Finding regulatory regions conclusion
conclusion
• With more genomes to be sequenced, we can investigate the evolution effects on specific regions over the entire genomes • With precomputed data, users can focus at the biological level • 3 advances needed to be made:
– Need more genomes to improve the power – Power can be improved by knowing how negative selection works for different functional contraints – Knowing more about positive selection