Zebra Finch Seg Dup Analysis
2. Parameters for Pipeline
Zebra Finch Genome
• The Genome (Jul. 2008 assembly of the zebra finch genome taeGut1,
WUSTL v3.2.4) is downloaded from UCSU. This assembly was produced by
the Genome Sequencing Center at the Washington University in St. Louis
(WUSTL) School of Medicine.
• The zebra finch DNA used for the shotgun sequencing and the BAC and
cosmid libraries was derived from a single male domesticated zebra finch.
The initial assembly was generated using PCAP with approximately 6X
coverage. About 1.0 Gb of the 1.2-Gb genome has been ordered and
oriented along 33 chromosomes and one linkage group. The chromosome
names are based on their homologous chromosomes in the chicken (Gallus
• Total genome size (gapped) 1,233,186,341 bp
Seg Dup detection pipelines
• WGAC to detect Seg Dup in genomic assemblies by
looking for homologouse pairs ( >1 kb in length >90%
• WSSD to detect Seg Dup in given sequences based
on depth coverage of WGS (whole-genome shotgun
reads). Depth coverage > Average + 3SD. Done by
Parameters and notes for WGAC pipeline
– The sequences download from UCSC has been soft masked.
• UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata'
– The repeat coordinates were reverse generated based on the soft-
• Blast parsing seeds in WGAC pipeline:
– the seed size is 250 bp.
Result from WGAC Pipeline
• Total pairs of WGAC detected
(>1 kb and >90% identity) 198180
• Inter chromosome pairs 81415
• Intra chromosome pairs 116742
• Chromosome inter and intra
(excluding chr_random and chrUn) 26510
• ChrUn inter and intra 172670
• Total WGAC NR (bp) 384,501,909
• Total genome size (with gap) 1,233,186,341
• The NR space of WGAC is about 31% zebra finch genome, which is too high. It is
either due to the incomplete repeat masking or redundant sequences in chr_random
and chrUn. 87% of the total WGAC pairs (inter and intra) have at least one sequence
in each pair is on chrUn. The result indicates a big portal of false positive WGAC is
General analysis of WGAC length and identity
1. Length distribution peaked at 1-2 kb, intra > inter, with 87% of WGAC related to chrUn.
2. Identity distribution peaked at 97-98%. Few are higher than 99%.
WGAC length distribution WGAC identity distribution
200000000 140000000 intra
WGAC Length (bp) Identity
General analysis, NR distribution on chromosome high SD in chrUn
None redundant WGAC length distribution on Chromosome
Percentage of none redundant WGAC on chromosome
chromosome percent (%)
Global image shows the inter and intra pairs of 10 kb and above 90% in identity without or with chrUn.
The red indicates the inter chromosomal pairs and blue indicates intra chromosomal pairs.
WSSD analysis done by Ginger
• Downloaded the WGS reads; about 11,683,735 reads from trace archive at
• Downloaded zfinch-finished BACs. These BACs are used to determine the
threshold for WGS depth coverage. For 5-kb window, the average number of
reads is 59. The threshold for 5-kb window is 110, for 1-kb it’s 22.
• Used UCSC taeGut1 database rmsk tables as input to mask the genome for
repeats with divergence <=10%.
(UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata')
• A total of 16,076 regions with 44,218,871 bp were found in
wssdGE10K_nogap.tab (which has a 10-k cut-off). 13,782 of them
are on chrUn.
• A summary table of WGAC intersect with WSSD is at
General view showing WGAC (>5kb) and WSSD on all chromosomes
Grey above lines are WSSD
Brow below lines are WGAC
Union of WSSD and WGAC
gene intersect with Seg Dups
• A nonredundant union of WGAC and WSSD is generated with cut-
off size at 10 kb (AllDup10kb.tab). There are 3,839 NR regions with
50,902,487 bp, which is about 10 mb more than WSSD alone.
• However, be aware there may be false positive sites, especially on
chrUn, since we know there are high false positive WGACs on
chromosomes and chrUn.
Summary table 1
total chrN chrUn interval file
wssd (bp) 44,218,871 11,237,985 35,080,886 729 wssdGE10K_nogap.tab
wgac (bp) 384,501,909 232,493,308 152,008,601 7387 oo.weild10kb.join.all.cull
AllDup (bp) 394,988,746 235,022,961 159,965,785 5934 allDUP
Wgac shared 8,195,577 3,182,128 5,013,449
Genome (bp) 1,233,186,341 1,057,961,026 175,225,315
Large SDs >=10 kb
• SD >=10 kb in size were pulled out. There are a total of
3,839 intervals with length 50,902,487 bp in the allDup.tab.
The study of the chromosome only
• The Segment duplications on sequences assigned to
chromosome should be more reliable sequences with
• It should contains sequences reflecting best of the
• Total Dup length 105,145,288 bp
• Intra Dup length 100,234,309 bp
• Inter Dup length 8,499,428 bp
• More Dup is intra chromosome dup >90%
• These intra chromosome dup are predominantly short
range intra dup, see the global view on next slide
Global view of 90%-5k and 94%-5k respectively, showing significant
amount of WGAC pairs are intra chromosome short range duplications.
The blowup view showing WGAC on chromosome 1 at 5k
and 94%. This is WGAC detected on sequences assigned
to chromosome only
Intra chromosome Detail of a sample region on chr1
Depth of coverage
WSSD Assembly Gaps
identity for the for
the reads mapped
to the region.
Yellow > 97%
Green > 96%
Text description for slide 20
• Each black line represent the chromosome regions as indicated by ticks.
• Blue bars and pairs are the intra chromosome homologous pairs (segment
• Red bar and pair on chromosome line represent the inter chromosome
homologous pairs (inter chromosome Segment Duplications).
• The grey bars under the chromosome line represent the depth of coverage
at the regions by WGS reads in 1kb window. The longer the bar is , the
higher the depth of coverage by sequence reads.
• The color bar under the chromosome line represent the average identity for
all the reads mapped to the region. Red(>99%), Orange(>98%),
yellow(>97%), green (>96%).
• The black bar above the chromosome line represent WSSD detected.
• The purple vertical line on chromosome line represent the assembly gaps.
• Each tick represent the 10000bp; each line is 100kb.
• Most of the intra chromosomal pairs are very close to
each other. In most cases, one sequence within the pair
has gaps on both ends, which suggest the contig is not
physically connected to its adjacent sequences. It was
placed at current position by the mate pairs.
• Some of them are also next to each other, separated by
• We have not see in sampled region that a single contig
contains both sequences within the pairs of intra
chromosome segment duplications.
• Consider observation mentioned above, we think there is
a high possibility that they could be assembly artifacts
introduced by assembler.