Zebra Finch Seg Dup Analysis 1. Genome 2. Parameters for Pipeline 3. Analysis Zebra Finch Genome • The Genome (Jul. 2008 assembly of the zebra finch genome taeGut1, WUSTL v3.2.4) is downloaded from UCSU. This assembly was produced by the Genome Sequencing Center at the Washington University in St. Louis (WUSTL) School of Medicine. • The zebra finch DNA used for the shotgun sequencing and the BAC and cosmid libraries was derived from a single male domesticated zebra finch. The initial assembly was generated using PCAP with approximately 6X coverage. About 1.0 Gb of the 1.2-Gb genome has been ordered and oriented along 33 chromosomes and one linkage group. The chromosome names are based on their homologous chromosomes in the chicken (Gallus gallus). • Total genome size (gapped) 1,233,186,341 bp Seg Dup detection pipelines • WGAC to detect Seg Dup in genomic assemblies by looking for homologouse pairs ( >1 kb in length >90% identity). • WSSD to detect Seg Dup in given sequences based on depth coverage of WGS (whole-genome shotgun reads). Depth coverage > Average + 3SD. Done by Ginger Cheng. Parameters and notes for WGAC pipeline • Repeats – The sequences download from UCSC has been soft masked. • UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata' – The repeat coordinates were reverse generated based on the soft- masked sequences. • Blast parsing seeds in WGAC pipeline: – the seed size is 250 bp. Result from WGAC Pipeline • Total pairs of WGAC detected (>1 kb and >90% identity) 198180 • Inter chromosome pairs 81415 • Intra chromosome pairs 116742 • Chromosome inter and intra (excluding chr_random and chrUn) 26510 • ChrUn inter and intra 172670 • Total WGAC NR (bp) 384,501,909 • Total genome size (with gap) 1,233,186,341 Notes: • The NR space of WGAC is about 31% zebra finch genome, which is too high. It is either due to the incomplete repeat masking or redundant sequences in chr_random and chrUn. 87% of the total WGAC pairs (inter and intra) have at least one sequence in each pair is on chrUn. The result indicates a big portal of false positive WGAC is from chrUn. General analysis of WGAC length and identity distribution 1. Length distribution peaked at 1-2 kb, intra > inter, with 87% of WGAC related to chrUn. 2. Identity distribution peaked at 97-98%. Few are higher than 99%. WGAC length distribution WGAC identity distribution 250000000 180000000 160000000 interlen inter 200000000 140000000 intra intralen 120000000 Total (bp) Total (bp) 150000000 100000000 80000000 100000000 60000000 40000000 50000000 20000000 0 90.00% 91.00% 92.00% 93.00% 94.00% 95.00% 96.00% 97.00% 98.00% 99.00% 99.50% 100.00% 0 1.kb 2.kb 3.kb 4.kb 5.kb 6.kb 7.kb 8.kb 9.kb 10.kb 20.kb 30.kb 40.kb 50.kb WGAC Length (bp) Identity chrUn chrUn General analysis, NR distribution on chromosome high SD in chrUn chrZ_random chrZ_random chrZ chrZ chrLGE22_random chrLGE22_random chrLGE22 chrLGE22 chrLG5 chrLG5 chrLG2 chr28_random chrLG2 chr28_random chr28 chr28 chr27_random chr27_random chr27 chr27 None redundant WGAC length distribution on Chromosome chr26_random chr26 chr26_random chr25_random chr26 chr25 chr25_random chr25 Percentage of none redundant WGAC on chromosome chr24_random chr24 chr24_random chr23_random chr24 both inter intra chr23 chr23_random chr23 both inter intra chr22_random chr22 chr22_random chr21_random chr22 chr21 chr21_random chr20_random chr21 chr20 chr20_random chr19_random chr20 chr19 chr19_random chr18_random chr19 chr18 chr18_random chr17_random chr18 chr17 chr17_random Chromosome Total (bp) chr16_random chr17 chr15_random chr16_random chr15 chr15_random chr14_random chr15 chr14 chr14_random chr13_random chr14 chr13 chr13_random chr12_random chr13 chr12 chr12_random chr11_random chr12 chr11 chr11_random chr10_random chr11 chr10 chr10_random chr9_random chr10 chr9 chr9_random chr8_random chr9 chr8 chr8_random chr7_random chr8 chr7 chr7_random chr6_random chr7 chr6 chr6_random chr5_random chr6 chr5 chr5_random chr4_random chr5 chr4A_random chr4_random chr4A chr4A_random chr4 chr4A chr3_random chr4 chr3 chr3_random chr2_random chr3 chr2 chr2_random chr1B_random chr2 chr1B chr1B_random chr1A_random chr1B chr1A chr1A_random chr1_random chr1A chr1 chr1_random 0 80000000 60000000 40000000 20000000 160000000 140000000 120000000 100000000 chr1 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 100.00% chromosome percent (%) Global image shows the inter and intra pairs of 10 kb and above 90% in identity without or with chrUn. The red indicates the inter chromosomal pairs and blue indicates intra chromosomal pairs. Without chrUn With chrUn WGAC page • http://eichlerlab.gs.washington.edu/help/lin chen/zfinch/zfinch_wgac.html WSSD analysis done by Ginger http://eichlerlab.gs.washington.edu/help/ginger/zebrafinch/ • Downloaded the WGS reads; about 11,683,735 reads from trace archive at NCBI. • Downloaded zfinch-finished BACs. These BACs are used to determine the threshold for WGS depth coverage. For 5-kb window, the average number of reads is 59. The threshold for 5-kb window is 110, for 1-kb it’s 22. • Used UCSC taeGut1 database rmsk tables as input to mask the genome for repeats with divergence <=10%. (UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata') WSSD results • A total of 16,076 regions with 44,218,871 bp were found in wssdGE10K_nogap.tab (which has a 10-k cut-off). 13,782 of them are on chrUn. • A summary table of WGAC intersect with WSSD is at http://eichlerlab.gs.washington.edu/help/linchen/zfinch/data/wgacCMPwssd.out.xls General view showing WGAC (>5kb) and WSSD on all chromosomes Grey above lines are WSSD Brow below lines are WGAC Union of WSSD and WGAC gene intersect with Seg Dups • A nonredundant union of WGAC and WSSD is generated with cut- off size at 10 kb (AllDup10kb.tab). There are 3,839 NR regions with 50,902,487 bp, which is about 10 mb more than WSSD alone. • However, be aware there may be false positive sites, especially on chrUn, since we know there are high false positive WGACs on chromosomes and chrUn. Summary table 1 No. nr total chrN chrUn interval file wssd (bp) 44,218,871 11,237,985 35,080,886 729 wssdGE10K_nogap.tab wgac (bp) 384,501,909 232,493,308 152,008,601 7387 oo.weild10kb.join.all.cull AllDup (bp) 394,988,746 235,022,961 159,965,785 5934 allDUP Wssd and Wgac shared 8,195,577 3,182,128 5,013,449 Genome (bp) 1,233,186,341 1,057,961,026 175,225,315 Large SDs >=10 kb • SD >=10 kb in size were pulled out. There are a total of 3,839 intervals with length 50,902,487 bp in the allDup.tab. The study of the chromosome only WGAC • The Segment duplications on sequences assigned to chromosome should be more reliable sequences with less artifact. • It should contains sequences reflecting best of the assembly. • Total Dup length 105,145,288 bp • Intra Dup length 100,234,309 bp • Inter Dup length 8,499,428 bp • More Dup is intra chromosome dup >90% • These intra chromosome dup are predominantly short range intra dup, see the global view on next slide Global view of 90%-5k and 94%-5k respectively, showing significant amount of WGAC pairs are intra chromosome short range duplications. The blowup view showing WGAC on chromosome 1 at 5k and 94%. This is WGAC detected on sequences assigned to chromosome only Intra chromosome Detail of a sample region on chr1 Homology pairs Grey Depth of coverage by reads WSSD Assembly Gaps The average identity for the for the reads mapped to the region. Red >99% Orange >98% Yellow > 97% Green > 96% Text description for slide 20 • Each black line represent the chromosome regions as indicated by ticks. • Blue bars and pairs are the intra chromosome homologous pairs (segment duplications) found. • Red bar and pair on chromosome line represent the inter chromosome homologous pairs (inter chromosome Segment Duplications). • The grey bars under the chromosome line represent the depth of coverage at the regions by WGS reads in 1kb window. The longer the bar is , the higher the depth of coverage by sequence reads. • The color bar under the chromosome line represent the average identity for all the reads mapped to the region. Red(>99%), Orange(>98%), yellow(>97%), green (>96%). • The black bar above the chromosome line represent WSSD detected. • The purple vertical line on chromosome line represent the assembly gaps. • Each tick represent the 10000bp; each line is 100kb. result • Most of the intra chromosomal pairs are very close to each other. In most cases, one sequence within the pair has gaps on both ends, which suggest the contig is not physically connected to its adjacent sequences. It was placed at current position by the mate pairs. • Some of them are also next to each other, separated by a gap. • We have not see in sampled region that a single contig contains both sequences within the pairs of intra chromosome segment duplications. • Consider observation mentioned above, we think there is a high possibility that they could be assembly artifacts introduced by assembler.
Pages to are hidden for
"Zebra Finch Seg Dup Analysis"Please download to view full document