COMPARATIVE GENOMICS GENOME-WIDE ANALYSIS IN METAZOAN EUKARYOTES

Reviews
COMPARATIVE GENOMICS: GENOME-WIDE ANALYSIS IN METAZOAN EUKARYOTES Mao-Feng Ger 02/08/2006 • Completely sequenced genome could be used for large-scale comparative analysis • Effective methods for enormous data are objectives • Main areas in comparative genomics: – Whole-genome alignment – Gene prediction – Regulatory-region prediction • • • • • Introduction Whole-genome alignments Gene prediction Finding regulatory regions conclusion Sequenced genomes • In NCBI Genomic Project Database, up to 02/05, the # of genome project: – Archaea: 51 (complete , draft assembly, in progress) – Bactria: 821 (complete , draft assembly, in progress) – Eukaryota: 391 (complete, draft assembly, in progress, organelles) • In eukaryota, 174 are metazoans. Comparative genomics • Presumption: two genomes are from a common ancestor, so every bp is the combination of the original genome and the action of evolution • Evolution: mutation + selection – Can be represented by a rate matrix • Selection : – Negative selection – Neutral selection – Positive selection BLOSUM and PAM rate matrices • PAM (Percent Accepted Mutation) – From a set of proteins which are at least 85% identical – Numeric suffix means the number of self multiplication • BLOSUM (BLOcks SUbstitution Matrix) – More empirical and from a large dataset – Contructed by extracting ungapped segments (blocks) from a set of aligned protein families – Numeric suffix means at least x% identity to the blocks Difficulties in aligning genomes • Knowing so little about evolution processes that we’d better focus on functional sequences • Because genome size differences and genome readiness, doing whole-genome alignments is pretty difficult • Recently, there are more and more programs dealing with large-scale comparisons. Biologist need to know these approaches. • • • • • Introduction Whole-genome alignments Gene prediction Finding regulatory regions conclusion Precomputed alignments • Several groups have made large crossspecies comparisons – UC Santa Cruz/PennState (translated BLAT or BLASTZ) – Berkeley Genome Pipeline (BLAT/AVID) – Ensembl (Phusion/Blastn) Whole-genome alignment Which genome to align • Sufficient similarity between genomes enable the easy identification of homologous regions – Example: DNA alignment between human and mouse resulted in finding new genes and gene regulatory regions • Alignment between human and puffer fish, though less easy, is still feasible Comparing genomes at protein level • Not closely related genomes might have problems to align genomes at nucleotide level • At protein level, might lost info which can help finding new genes and regulatory sequences • It is better to start from closely related genomes Alignment strategy • Dynamic programming makes alignment tractable as long as you follow a few rules • Needleman/Wunch: align sequences globally • Smith/waterman: align sequences locally – No negative score, at least 0 – Tracing to 0 • However, limitations: – Cannot handle rearrangement such as inversion, duplication, translocation – For long sequence (>10,000 bp), very expensive in time and memory usage Seeding strategy • Because correct alignment comes from stretches of ungapped matches • So, first finding a set of ungapped matches (seeds) • Then, extending gapped alignment from where seeds happens. • Loss in sensitivity but reward in time and memory usage • Consecutive model and Two weightedspaced model • Simply put, – Seeding – Seeds used as nucleation point for extension – Dynamic programming to produce gapped alignments • In this review, we focus on 4 whole-genome alignment methods – – – – BLASTZ BLAT/AVID BLAT/LAGAN WABA BLASTZ • Local aligner, like BLASTZ and BLAT, are highly sensitive but less specific • BLASTZ applies several methods to increase sensitivity and specificity – Seeding: instead of 11 consecutive model, new BLASTZ used two weighted-spaced model(12 of 19 and tolerate a transition among 12) – Extend the seeds without gaps – Extend gapped alignment: down-weight lowcomplexity matches first • In mouse-human alignment case, using a specific scoring matrix from known mousehuman homology region • A post-processing step is needed to sort out the most significant orthologues in multiple matches • Overall, BLASTZ covered 98% coding region in mouse and human genome, indicating it is highly sensitive for identifying well-conserved regions BLAT • A local aligner: – Untranslated: designed to align cDNA to genomic sequences and less effective at < 90% identity – Translated mode: more effective in genome comparison. With mask for repeats and lowcomplexity, the output is faster and cleaner • Produce a set of ungapped alignments, good in speed at the expense of overall sensitivity • Used in human-pufferfish genome comparison Global alignment • 3 steps: – – – – 1: finding the maximal repeated region 2: clean matches first, then repeat matched Recursively step 1 and 2 3: <4kb, use NW algorithm; > 4kb, no significant alignment AVID • Assumption: strictly homologous and no gene duplication, inversion, translocation • When apply to a whole-genome, it needs a preprocessing step to identify syntenic regions LAGAN • The advantage of LAGAN over AVID is that it can align larger sequences – Lower memory requirement – Different matching algorithm in step 1 (not necessary to find exact matches) • In conjunction with BLAT, it has been applied to rat-human and rat-mouse comparisons MLAGAN • An extension of LAGAN • Can do multiple alignment • Align closely related genomes first, then incorporate others in order of phylogenetic distance WABA • Take genetic code degeneracy into consideration • Seeding step: based on nucleotides and use two weighted-spaced rule 6of8, which allow the third position to mismatch • No extension step, but group proper seeds to define homologous regions Biological correctness • There is no best way to do alignments – Know evolution inadequately to indicate which one is superior – Different algorithms are tuned to different genome comparisons (ex. BLASTZ in humanmouse case and WABA in C. elegan–C. briggsae case) – Purposes are different • Align as much as possible, regardless of selection (ex. AVID, LAGAN, BLASTZ) • Identify conserved regions which are under selection (ex. BLAT, WABA) • Most programs concern maximizing the homologous bps, while biologist are interested in conserved regions for a function. • For example, in the mouse-human alignment, 40% are alignable, but only 6% are under selection • To make things worse, substitution rate varies across genome. • • • • • Introduction Whole-genome alignments Gene prediction Finding regulatory regions conclusion Defining gene structure • Still a challenge because of poor signal-tonoise level • Comparison between closely related genomes could provide additional info (“dual genome” gene predictors) • Different programs are with different presumptions, so users need to know the strengths and limitations Dual genome gene predictor • To model a specific type of negative selection • Assumption: in alignments, most differences are neutral and regions without many mutations are conserved. • Combine other info, such splicing, wobble effect… to get a better model • Can be subdivided into 3 classes: – Pair-HMM: take math approach to determine joint gene structure and alignment – Informant approaches: fix on alignment to provide a better gene prediction – Exon-finding approaches: try to demark the exons without splicing them together Pair-HMM • HMM can be used to predict gene structure in a single genome • Paire-HMM can find the most likely path to have generate these sequences and provide the alignment as well as gene prediction • Contain two set of orthologous genes • Two pair HMM approaches: – SLAM – DoubleScan • Both need to optimize parameters for a specific species and better efficiency • SLAM uses AVID method to do rough alignment, while DoubleScan uses BLAST Informant appraoches • Use only one sequence to predict gene structure, and other one sequence is just for additional info by its alighment • Can predict not only genome sequence but also different inputs, like unassembled reads • 3 methods are available: TwinScan, SGP-2, GenomeScan • Need precise parameters, so have their own alignment methods (often BLAST) Exon prediction • A carefully parameterized TBLASTX method designed to provide specific exon prediction from Tetraodon • Sacrifice a certain amount sensitivity for high specificity Which method to use • A particularly successful way to do this work used informant methods combined with some simple criteria • Produce a strong prediction in mammalian genomes • • • • • Introduction Whole-genome alignments Gene prediction Finding regulatory regions conclusion Finding regulatory regions • Called phylogentic footprinting (analogous with DNAase footprinting) • Functionally important regions are mutated less • These cis-regulatory motifs can be dertermined by: – Finding common motifs in orthologous sequences – Aligingn orthologous sequences first, then indentifying common regions • Previously known motifs might help Which region to use • 5’ and 3’ flanking regions as well as intronic sequences • Difficulties in finding regulatory regions: – 5’ end is often the least well-defined, so we need experimental evidence of promoters – Enhancers could be several kilobases away • In addition to experimental evidence, guessing and systematic comparison is needed to potential cis-regulatory regions • Two orthologous genes might have very different regulatory cis-elements, such as paralogous genes • How evolution affects cis-regulatory motifs is still poorly understood • Intra-mammal comparison show a large amount of non-functional conservation, while in intravertebrate, it is hard to detect Evolutionary issues • Neutral drift effect could destroy or create cisregulatory sites at a certain rate • However, expression pattern could remain little/no changed • Raise the possibility of compensate mutation • Recently, some researchers try to distinguish regulatory regions from neutrally evolving DNA by genome sequence alignments • Motif finding programs do not consider the phylogenetic relationship between homologous sequences • Can be overcome by identifying DNA motifs evolving at a slower rate than the surrounding sequences • All motif-finding techniques work better with increasing amounts of sequences Motif overrepresentation Alignment for finding regulatory region • Aligning regions of homology in the noncoding regions near the orthologous genes • More and more researches show that cisregulatory elements are in non-conserved regions • • • • • Introduction Whole-genome alignments Gene prediction Finding regulatory regions conclusion conclusion • With more genomes to be sequenced, we can investigate the evolution effects on specific regions over the entire genomes • With precomputed data, users can focus at the biological level • 3 advances needed to be made: – Need more genomes to improve the power – Power can be improved by knowing how negative selection works for different functional contraints – Knowing more about positive selection

Related docs
premium docs
Other docs by Juan Agui
Sample Business Plan MyNetSales
Views: 368  |  Downloads: 13
Chinese Exclusion Act _1882_
Views: 140  |  Downloads: 13
NOTICE OF ENTRY OF JUDGMENT
Views: 241  |  Downloads: 0
FORM 6744 VITA TCE VOLUNTEER ASSISTORS TEST
Views: 422  |  Downloads: 1
FORM 240B ORDER ON REAFFIRMATION AGREEMENT
Views: 207  |  Downloads: 2
Sample Business Plan MusicStockMarket
Views: 300  |  Downloads: 9