21 by dandanhuanghuang


									Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)

      Assembly and Gene Identification of Contigs of

          Synechococcus sp. strain PCC 7002 Genome

                                             Wang Zhu
                            College of Life Science, Peking University


In this project, we try to sequence and annotate the genome of a cyanobacterium -- Synechococcus
sp. strain PCC 7002. Currently, prokaryotic genome sequencing is generally carried out by the
shotgun approach. We obtained 1kb-3kb sequence reads from Huada gene center. Cosmid end
sequences were also used. We utilized software package Phredphrap to perform the assembly of
these reads and have reduced the number of contigs to 242. The final goal is to construct a whole
genome sequence of several megabases with an error rate lower than 1 per 10000 nucleotides. The
largest contigs was analyzed by program GeneMarkS to correct frame shift errors and predict
genes in them. We also programmed a tool to extract these gene sequences from the report list of
GeneMarkS so that they can be under further studies.


The cyanobacteria are believed to be of very ancient origin, and are the answer of present-day
chloroplasts. Therefore, it is of great interest to analyze the structure and organization of genes in
this organism. Synechococcus sp. strain PCC 7002 is a unicellular cyanobacterium. The genomes
of other two cyanobacteria PCC 6803 and PCC 7120 have already been sequenced and can be
visited through public databases. 7002 genome is predicted to be 2.8Mb. Understanding its whole
genome may provide the basis for the studies of metabolism and photosynthesis. At present, the
most widely used strategy for the sequencing of a microbial genome is that of whole-genome
shotgun sequencing. A large number of clones, from the libraries representative of the whole
genome, are sequenced and assembled into contigs. The contigs are then linked together using a
variety of methods to obtain the whole genome sequence in a single contig. The process can be
divided roughly into several procedures, including library construction, reads assembly and
closure phase, as shown in Figure 1.

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)

                    Fig 1. Outline of genome sequencing, assembly and annotation

Both small-insert (1-2 kb) libraries and large-insert (20-300kb) libraries like BAC (bacterial
artificial chromosome) are needed to ensure appropriate coverage of the genome and obtain a
‘scaffold’ of the genome which is used during the closure phase. The small-insert library data of
PCC 7002 came from Huada gene center. The BAC library is being constructed in our lab.
The assembly phase is composed of three major steps: the conversion of the data from automated
sequencers to nucleotide sequences, the utilization of these sequences in the assembly process and
the continuous assessment of this assembly process. There are various software tools for such
work and we choose the Phredphrap package. It’s based on complex algorithms which perform
pairwise comparisons for all sequences, allowing automatic threshold selection with respect to the
decision of whether two sequences overlap or not. Clusters of overlapping sequences are
constructed and consensus sequences are deduced from these clusters. The assembly-result file in
Phrap format can be viewed with software Consed. In Consed, contigs are shown with the
sequences composing of them. It’s easy to find and edit low-quality sequences and the assembly
results can be assessed vividly.
The goals of annotation include detecting and describing the protein-coding sequence, the
structure of these genes (including untranslated regions and control elements), homology
comparison between the sequences being analyzed and sequences available from public databases
at either nucleic acid or the protein level. As long as contigs are constructed, preliminary
annotation can be carried out. In this research, we use software GeneMarkS, which is based on
heuristic Markov model, to predict genes in contigs as well as correct frame shift errors.

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
Methods and Results

Ⅰ. Assembly

1. Base calling
The computer operation system is Unix/Linux. The reads sequences are stored as chromat files in
directory “chromat_dir”. One example of the file is shown in Figure 2.

                        Fig 2. Part of a “.abi” chromat file viewed with Consed

Phred uses simple Fourier methods to examine the four base traces in the region surrounding each
point in the data set in order to predict a series of evenly spaced predicted peak locations. Next
phred finds the centers of the actual (observed) peaks and a dynamic programming algorithm is
used to match the observed peaks with the predicted peak locations found in the first step. Phred
evaluates the trace surrounding each called base using a quality value (QV). The quality value is
related to the base call error probability (P_e) by the formula:
                                             P_e=10 QV/10

Run phred with the options:
% phred –id chromat_dir –pd phd_dir
which causes phred to read the chromat files in “chromat_dir” and write the converted “.phd” files
to “phd_dir”. In “.phd” files comments on the file conversion process are listed first, then are the
bases information as shown below in Table 1.

                              <base>     <quality>     <position in chromat>
                                c           34                 2622
                                a           42                 2634
                                a           42                 2647
                                 t          33                 2660
                                a           31                 2674
                                 t          33                 2685
                                 t          11                 2698
                                g           11                 2711
                   Tab 1. Part of the file “s_14291.y1.abi.phd” showing phd format

Then run the phd2fasta program to make FASTA files. After running
% phd2fasta –id phd_dir –os seqs_fasta –oq seqs_fasta.screen.qual
two files are created. File “seqs_fasta” records all the sequences in FASTA format (shown in Table
2.) and file “seqs_fasta.screen.qual” records the quality value of each base in all the sequences.

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)

>001-h10.urt.abi CHROMAT_FILE: 001-h10.urt.abi PHD_FILE: 001-h10.urt.abi.phd.1 CHEM:
term DYE: big TIME: Sun Jun 2 17:06:07 2002 TEMPLATE: 001-h10 DIRECTION: fwd
>001-h11.uft.abi ………...
                      Tab 2. Part of the file “seqs_fasta” showing FASTA format

2. Vector screening
All the sequencing results contain the vector sequence at the two ends of the insert sequence. They
should be screened out before assembly. This is done by program cross_match:
% cross_match seqs_fasta vector.seq –minmatch 12 –minscore 20 –screen > screen.out
File vector.seq, also in FASTA format, contains all the vector sequences we want to screen for
(pUC19, pBluSKM, pBluSKP). The “–screen” option causes a file named “seqs_fasta.screen” to
be created, containing vector-masked versions of the original sequences. This “.screen” file is
what later is provided as input to Phrap. The output file “screen.out” lists the matches that were

3. Assembly
The program phrap is based on Smith-Waterman algorithm (SWAT) and so is cross_match. It
scores pairwise alignment and constructs contig sequences as a mosaic of the highest quality parts
of reads.
Run phrap to perform the sequence assembly as follows:
% phrap seqs_fasta.screen –ace > phrap.out
Phrap writes the assembled contigs to the file “seqs_fasta.screen.contigs” (shown in Table 3.) and
creates a “seqs_fasta.screen.contigs.ace” file that can be used for importing the assembly to
Consed for assessment and editing. The assembly output information is contained in file

                           Tab 3. Part of the file “seqs_fasta.screen.contigs”

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
4. Viewing with Consed
  We used program Consed to view the assembly results generated by phrap (shown in Figure 3).
The consensus is on the top line and reads that match are listed below. The darker region of a read
is the low-quality area and an erroneous base is marked by red color. When a chromat file “.abi” is
displayed in consed, erroneous bases in a sequence can be changed by clicking the middle button
of the mouse on the bases and then editing them.

                              Fig 3. Viewing assembly results with Consed

Ⅱ. Results

Actually, all the assembly steps can be run under one combined program phredphrap:
% phredphrap &
Before running, we have edited the file “phredphrap” and changed some parameters to meet our
needs. The parameters are listed below: (their meaning discussed later)
-trim_qual 20        -trim_start 10      -repeat_stringency 0.95      -forcelevel 1
-bypasslevel 1       -maxgap 35          -minmatch 12                 -minscore 30
-maxmatch 30         -vector_bound 50 -max_subclone_size 8000

We put 26911 entries for assembly and got 242 contigs (see Figure 4) in total. Average quality
value of these entries was 25.0. About 62.57% of them were bidirectional clones, 33.10%
unidirectional clones, 2.49% walking sequence, 1.36% cosmid end sequence and the remaining
0.48% were genes from EMBL and Genebank. The entries’ average full length was 801.4bp and
was reduced to 657.2bp after trimming (remove the beginning low quality bases). Average quality
value of consensus sequences is 48.0 per base. The number of confirmed reads is 25494 and the
remaining are singlets that can’t be linked with any other reads. The depth of coverage reached 5.8.
We estimated preliminarily that the size of Synechococcus sp. strain PCC 7002 genome is

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)

                                                                                     327789bp, 10%

                                                                               155659bp, 5%

                                                                       121021bp, 4%

                     Fig 4. Composition of contigs and the size of the largest ones

Ⅲ. Gene Identification

1. Running GeneMarkS
Now the largest contig obtained from phredphrap is 328kb in large. These large contigs can be the
substrate of annotation. Gene identification is the first step of annotation and we use software
GeneMarkS to predict genes in them. GeneMarkS uses an improved version of the gene finding
program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the
Gibbs sampling multiple alignment program. It’s especially useful for newly sequenced
prokaryotic genome with no prior knowledge of any protein or rRNA genes.
We submitted the contig sequences online at website:
The results were sent via email. For each contig, we acquired a postscript graphics file which
demonstrated the predicted genes (see Figure 5) and a text file containing the begin and end
locations of the genes. The plateau in the postscript graphics indicates the range of the gene.
Genes could be found both in direct sequence and complementary sequence, and they could be in
frames 1,2,3 or -1,-2,-3.

                          Fig 5. Predicted genes demonstrated by GeneMarkS

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
2. Frame shift correction
Besides gene identification, we found GeneMarkS results very useful for the identification of
frame shift. As we know, a gene-coding sequence can be translated in three frames. So any base
deletion or insertion may cause the translation change from one frame to another, i.e. frame shift.
A frame shift can be easily identified in the postscript graphics file, usually at the site where an
arrow points from one plateau to adjacent plateau in another frame. One example is shown in
figure 6.

                            Fig 6. Frame shift shown in postscript graphics

We reexamined the target sequence in Consed (see Figure 7), and found one possible base deletion
at site 93432. File “s_6377.y1.abi” suggested a “c” here, but the consensus omitted it.

                      Fig 7. Reexamine the sequence at possible frame shift site

Then we checked the chromat files “s_6377.y1.abi” and “s_6849.g1.abi” and judged that there
should be a “c” at site 93432 (as shown in Figure 8). So a “c” was added to sequence
“s_6849.g1.abi” and thus to the consensus. The frame shift was corrected.

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)

                           Fig 8. Confirmation and Correction of frame shift

Then we ran GeneMarkS again and found the frame shift error was indeed corrected (see Figure

                 Fig 9. Frame shift error was corrected in the new postscript graphics

Running blastx against PCC 6803 genome database also revealed the frame shift (see Table 4,
Before correction). After correction, we ran blastx again, and this time the gene was complete and
it matched very well (see Table 4, After correction).

                     E-value      Frame     Start site   End site   Match start   Match end   AA length
    Before          3.00E-67        1        92056         93432          3         460         493
   correction       3.00E-67        3        93435         93527        462         492         493
After correction    3.00E-73        1        92054         93526          3         492         493
              Tab 4. Comparison between before and after correction of the frame shift

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
With this method, most of the frame shift errors in contigs can be detected and corrected.
Sometimes it’s difficult to decide confidently where the frame shift occurs for the low quality of
the examined sequence. In that case, we shouldn’t rush to correct the error as we think what it
should be. More data should be added before further consideration.

3. Information extraction
When mistakes were reduced to minimum, we can extract these genes out from the contig
sequences and get the protein sequences. So we wrote a program called “extract” to extract all the
gene sequences from the information of their begin and end locations in the contigs, and another
program called “translate” to translate these gene sequences to protein sequences. They were both
written in Perl. For the length limitation of this paper, program lines are not listed here.

4. Analysis of contig 241
We chose the second largest contig - contig 241 (156kb) to be analyzed. After running
GeneMarkS and correcting frame shift errors (about one error per 10kb), we know that there are
about 142 genes in this contig (every 1096bp has a gene in either strand). The average length of
the genes is 918bp. So it’s clear that coding sequences comprise most regions of this cyanobacteria
genome and genes are arranged in line consecutively with very short gap between them. Then this
contig was performed by running blastx against PCC 6803 protein database. A threshold e-value
of 1E-20 was used for this analysis. Part of the blastx results are shown below in Table 5. We
compared these proteins with the genes predicted by GeneMarkS, finding that 102 genes share
homology with 6803 proteins and the other 40 genes are unique for 7002. These unique genes may
be of special interest to 7002’s morphology and metabolism and thus need further experiments to
reveal their structure and function.

                     Tab 5. Part of results of the contig 241 blastx against PCC 6803
Query Query                                                             genetic Length   Sbjct   Sbjct
               frame e-value      ORF No             product
init    end                                                              symbol   (aa)   init     end

 202    1128     1      1E-126    sll1598     Mn transporter MntC         mntC    330     15      326

1189    1911     1       2E-24    sll0385       ABC transporter                   284     42      276

1207    1851     1       8E-38    slr2044       ABC transporter                   289     21      231

1210    1941     1      1E-110    sll1599     Mn transporter MntA         mntA    260      9      252

1213    1914     1       4E-23    sll0489       ABC transporter                   342      2      224

1213    1911     1       1E-30    slr1318     iron(III) dicitrate         fecE    268     10      244
                                               transport system
                                            permease protein FecE

1213    1839     1       5E-21    sll1453      nitrate transport          nrtD    332     19      225
                                                  protein NrtD

1261    1845     1       4E-21    sll0778       ABC transporter                   790     244     433

1980    2780     3       6E-22    slr2045    hypothetical protein                 281     10      277

1998    2810     3      1E-107    sll1600     Mn transporter MntB         mntB    306     12      282

3037    3639     1       5E-33    slr0006                                         217     18      217

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
Query Query                                                             genetic Length   Sbjct   Sbjct
               frame e-value      ORF No             product
init    end                                                              symbol   (aa)   init     end

5280    4426    -1      4E-77     slr1559           shikimate             aroE    290      4      288

6074    5361    -2      2E-55     sll1123                                         256     15      255

7233    6142    -1     1E-142     sll0245    hypothetical protein                 363      1      363

9115    8153    -3      3E-32     slr1225     protein kinase PknA         pknA    495      3      344

9118    8243    -3      4E-32     slr1697     eukariotic protein          pknA    574      1      296

12603 13283      3      1E-24     slr1113       ABC transporter                   332     11      244


1. Often we got new sequence data, which was not in chromat files, but text files containing the
sequences. In such case, we first edited the files in Word and saved them in FASTA format.
Program SeqVerter, which can conveniently merge several sequence files into one file or do the
reverse process, may be especially useful for the creation of proper file format. Next we used
program mktrace to create an ideal chromat file “.scf”, for example:
% mktrace 3e7-3-cos 3e7-3-cos.scf
We simply wrote a shell and add this sentence “mktrace $i $i.scf” to batch process a directory of
files and saved the new chromat files in directory “chromat_dir”.
Having these chromat files in hand, we could easily run phredphrap again and these new sequence
information were added into assembly process.
In fact, any FASTA file can be treated with mktrace to create a chromat file, which can then be
viewed and edited in Consed.

2. Many parameters in command line options of phrap can greatly affect the results of assembly.
For instance, -minmatch sets the minimum length of matching word to nucleate SWAT
comparison. Increasing -minmatch can dramatically decrease the time required for the pairwise
sequence comparisons; it also tends to have the effect of increasing assembly stringency. For
example, when we changed this parameter from 12 to 14, the number of contigs increased from
242 to 253. However, it may cause some significant matches to be missed. Parameter -
maxmatch sets the maximum length of matching word. Parameter -minscore sets the minimum
alignment score. Phrap scores pairwise sequence alignment as follows: matching residues receive
a reward of +1, mismatches get a penalty of -2, gap opening residues a penalty of -4, and gap
extension residues a penalty of -3. So a sequence alignment score must be above the -minscore
before these sequences can be put together. Parameter -vector_bound sets the number of
potential vector bases at beginning of each read. Matches that lie entirely within this region are
assumed to represent vector matches and are ignored. Parameter -max_subclone_size checks the
maximum size of forward-reverse read pair. Parameter -trim_start sets the number of bases to be
removed at the beginning of each read as these bases are often of low quality. Parameter -
repeat_stringency controls stringency of match required for joins and -forcelevel and -

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
bypasslevel regulate stringency during final contig merge pass(0 is most stringent and 10 is the
least). We have tested different values of these parameters and have run phredphrap after each
change. The current parameters are set to reach a balance between minimizing number of contigs
and avoiding mismatches.

3. Now our work is still going on and the final goal is to obtain the whole genome sequence,
annotate the genome and set up a web database for public use. We are now at the gap-closure
phase in which we should link the contigs together with specific PCR products or cloned inserts
that span each gap so that a single contig is obtained. We have several methods to link contigs
together. We have done blastx and found that if ends of two contigs encode different parts of the
same protein, these contigs are probably neighbors. Our lab is also constructing BAC libraries
(40kb). If the terminal sequences of a single BAC clone belong to different contigs, it also
indicates neighborhood of these two contigs. All potential neighbor predictions have to be verified
by standard or long-range PCR. For the contigs without identified neighbors, gaps in genomic
sequences may be bridged by chromosomal-walking methods.
Annotation is another huge task to be solved. It includes genes identification (as we have done in
contigs) and characterization, deduction of metabolic pathways and prediction of protein structure.
Additionally, from the complete view of the whole genome, we may discover gene displacements
and horizontal transfers, which may help to trace evolutionary networks.


I’d like to express my sincere thanks to Professor Zhao Jindong. It’s he who led me to the frontier
of biological research field. I thank Professor Luo Jingchu for his guidance on bioinformatics. I
also thank Lee Tao who is in charge of the whole project and had helped me a lot in my research
work. Finally, I should thank Dr. Lee Tsung-Dao and Chun-Tsung foundation for offering me this
research opportunity.


Frangeul, L, et al. (1999). Cloning and assembly strategies in microbial genome projects.
Fleischmann, W, et al. (1999). A novel method for automatic functional annotation of proteins.
Besemer, J, et al. (2001). GeneMarkS: a self-training method for prediction of gene starts in
microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids
Bouck, J, et al. (1998). Analysis of the quality and utility of random shotgun sequencing at low
redundancies. Genome Research, 8,1074-1084.
Frederick, R. et al. (1997). The complete genome sequence of Escherichia coli K-12. Science, 277,
Ewing, B. et al. (1998). Base-calling of automated sequencer traces using phred. II. Error
probabilities. Genome Research,8,186-194.
McMurray, A. et al. (1998). Short-insert libraries as a method of problem solving in genome

Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
sequencing. Genome Research, 8, 562-566.
Nelson, K. et al. (1999). Evidence for lateral gene transfer between Archaea and Bacteria from
genome sequence of Thermotoga maritima. Nature,399,323-329.
Gordon, D. et al. (1998). Consed: a graphical tool for sequence finishing. Genome
Barnes, W. (1994). PCR amplification of up to 35kb DNA with high fidelity and high yield from
lambda bacteriophage templates. PNAS USA,91,2216-2220.
Staden, R. (1979). A strategy of DNA sequencing employing computer programs. Nucleic Acids
Huang, X. et al. (1996). An improved sequence assembly program. Genomics,33,21-31.
Makoto H. et al. (1995). Computer survey for likely genes in the one megabase contiguous
genomic sequence data of Synechocystis sp. strain PCC 6803. DNA Research,2,239-246.
Green, P. et al. (1997). Against a whole-genome shotgun. Genome Research, 7,410-417.
Cole, S. et al. (1998). Deciphering the biology of Mycobacterium tuberculosis from the complete
genome sequence. Nature,393,537-544.
Takakazu, K. et al. (2001). Complete genomic sequence of the filamentous nitrogen-fixing
cyanobacterium Anabaena sp. strain PCC 7120. DNA Research,8,205-213.

99级本科生。2001年5月开始受“ 政基金”资助参加科研工作。曾获ESEC奖学

感悟与寄语:一年半的“ 政”经历使我收获良多。我不仅掌握了基本分子生物
关切之情, 政”经历将永远是我的人生的一笔宝贵财富。

指导教师简介:       赵进东,生命科学学院副院长,长江特聘教授。美国 University of
Texas-Austin 博士学位。主要研究领域为蓝藻的光合和固氮作用。


To top