Learning Center
Plans & pricing Sign in
Sign Out

Breakthrough innovations in ultra-high throughput sequencing methods


									DNA SEQUENCING                                                                                  As published in BTi - October 2006

Breakthrough innovations in ultra-high
throughput sequencing methods
The first high throughput, low-cost alternative to standard sequencing                        microorganisms. For instance, sequenc-
systems that are based on Sanger chemistry has been introduced to the                         ing of a 3 megabase bacterial genome
                                                                                              up to a high quality draft is now possi-
market and is taking the sequencing world by storm. Developed by the
                                                                                              ble within days, rather than months.
US-based company 454 Life Sciences, the new technology — the Genome                           Due to the fact that high-quality reads
Sequencer 20 system — is exclusively distributed worldwide by Roche                           are generated at an average read length
Diagnostics. Using the new sequencing system, it is possible to address a                     of 100 bases, both de novo assembly as
broad variety of different applications in the fields of whole genome                         well as resequencing (mapping) of
sequencing, transcriptome and gene regulation studies, as well as ampli-                      genomes is possible. The mapping
con analysis. This article describes the principles and some of the applica-                  application generates the consensus
                                                                                              DNA sequence by mapping, or align-
tions of the new system. Many such applications simply cannot, for either
                                                                                              ment, of the reads to a reference
technical or economical reasons, be carried out using standard Sanger                         sequence, as well as a list of high-confi-
technology. The new system has already led to significant developments                        dence mutations. The current version of
in genomic research such as the identification of novel transcripts or                        the system software has the capacity to
unknown classes of small non-coding RNAS (sncRNAs).                                           analyse genomes up to 50 Mbp in size at
                                                                                              15-25x depth of coverage. Examples of
The Genome Sequencer 20 System is an        PicoTiterPlate device preparation, sequenc-       several bacterial genome assemblies are
ultra-high-throughput automated DNA         ing run, and data analysis. The output of a       shown in Table 1. The mapping applica-
sequencing system capable of carrying out   single run is typically 20 x 106 nucleotides or   tion typically results in greater than
and monitoring sequencing reactions in a    more (for the 70 x 75 mm PicoTiterPlate           99.99% accuracy over 95% of the non-
massively parallel fashion. Since the new   device) at an average read length of 100          repeat parts of the genome (Q40+
system provides a complete solution for     high-quality bases, and multiple runs can         bases), when the average genome cover-
ultra-high-throughput DNA sequencing,       be pooled for off-line assembly/mapping. It       age is at least 15-fold. The assembler
individual researchers can now prepare      is the combination of both the massive            application yields N50 contig sizes
samples and sequencing reactions, gener-    throughput and low costs per clonal read          greater than 10 kb with higher than
ate sequence reads, and assemble genome     that enables new applications which were          99.99% accuracy over 95% of the non-
sequence data within days. The whole        previously not possible to be carried out.        repeat parts of the genome (Q40+
genome sequencing workflow from                                                               bases), when the average genome cover-
sample input to data output consists of     WHOLE GENOME SEQUENCING                           age is at least 25-fold. (Contigs are con-
DNA library preparation, emulsion-based     The new system has already revolu-                tiguous sequences of DNA created by
clonal PCR amplification (emPCR),           tionised whole genome sequencing of               assembling overlapping sequences).
                                                                                                   data obtained from the paired-end
                                                                                                   libraries are combined with standard
                                                                                                   Genome Sequencer 20 whole genome
                                                                                                   shotgun sequencing reads in a new ver-
                                                                                                   sion of the assembler. The benefits of
                                                                                                   combining the reads from Genome
                                                                                                   Sequencer shotgun sequences with the
                                                                                                   paired-end reads have been tested on sev-
                                                                                                   eral bacterial genomes and on a
                                                                                                   Saccharomyces cerevisiae genome that had
                                                                                                   previously been sequenced at 454 Life
                                                                                                   Sciences. For instance, the 4.6 Mbp
                                                                                                   genome of E. coli K12 strain was
                                                                                                   sequenced in three standard runs to a
                                                                                                   depth of 22-fold. The assembly per-
                                                                                                   formed with the Newbler assembly soft-
Figure 1. Paired-end library preparation scheme: genomic DNA is fragmented to yield average ware resulted in 140 unoriented contigs.
fragment sizes around 2.5 kb. The fragmented genomic DNA is methylated with Eco RI methy- One additional sequencing run of a
lase to protect the Eco RI restriction sites. The ends of the fragments are blunt-ended and pol- paired-end library yielded approximately
ished, and a biotinylated oligonucleotide adaptor is blunt-end ligated onto both ends of the 112,000 reads. The paired-end data
digested DNA fragments. Subsequent digestion with Eco RI restriction enzyme cleaves a portion improved the genome assembly to 20
of the adaptor DNA, leaving sticky ends. The fragments are circularised and ligated, resulting in multi-contig scaffolds covering 98.6% of
2.5-kb circular fragments. The adaptor DNA contains two Mme I restriction sites and after the genome. The 12.2 Mbp genome of S.
treatment with Mme I the circularised DNA is cleaved 20 nucleotides away from the restriction cerevisiae S288C (16 haploid chromo-
site. Digestion generates small DNA fragments that have the adaptor DNA in the middle and somes and one 86 Kbp mitochondrion)
20 nucleotides of genomic DNA that were once approximately 2.5 kb apart on each end. These was shotgun sequenced in nine sequenc-
small, biotinylated DNA fragments are purified from the rest of the genomic DNA by strepta- ing runs yielding approximately 23-fold
vidin beads.                                                                                       over sampling. The assembly performed
                                                                                                   with the Genome Sequencer De Novo
Since the Genome Sequencer 20 System uses neither cloning nor Assembler resulted in 821 unoriented contigs. Two additional
electrophoretic separation, sequence coverage biases normally sequencing runs of a paired-end library yielded approximately
associated with these techniques are eliminated. Lack of 395,000 reads. The paired-end data reduced the assembly to 153
sequence coverage bias has been confirmed by sequencing scaffolds, covering 93.2% of the genome.
several bacterial genomes. The remaining gaps in assembled
genome sequences are due largely to the presence of sequence AMPLICON ANALYSIS
repeats longer than ~75 bp. This means that the Genome Sequence reads from the new system are on average 100 bases
Sequencer 20 System is particularly useful in sequencing AT-rich long, but are tens of thousands-fold deep. These characteristics
organisms resistant to subcloning in E. coli. One example is the open up a unique opportunity to use the system in applications
sequencing of the filamentous fungus Neurospora crassa. By where the detection of rare variants of a known sequence in
using the new sequencing technique, 2.5% additional sequence complex mixtures of sequences is crucial. Direct sequencing of
information has been identified compared with the Sanger mixed, non-clonal PCR products (amplicons) using standard
sequencing approach. Not surprisingly, the GC content of this Sanger dideoxy terminator chemistry is not sensitive enough to
additional information was quite low (27%).                                  identify and quantify many of the sequence variants present in
                                                                             biological specimens. Bacterial cloning of amplicons into a vec-
WHOLE GENOME SEQUENCING WITH PAIRED-END LIBRARIES tor prior to traditional sequencing of individual clones will
Recently, the developers of the new system, namely 454 Life increase the sensitivity, but at the cost of a large increase in time
Sciences of Branford, Connecticut, USA, have also developed a and expense, thus making this approach uneconomical in prac-
new protocol which makes whole genome sequencing using the tice.
Genome Sequencer 20 System even more efficient. Paired-end The 454 technology provides amplification of hundreds of thou-
libraries are generated and sequenced in order to determine the sands of molecules via the emulsion PCR step and highly accu-
orientation and relative positions of contigs produced by the de rate sequencing, as each fragment is sequenced to a depth of a
novo shotgun sequencing and assembly [Figure 1]. Sequence hundred- or a thousand-fold.

                                                complex samples with low tumour con-           (sncRNAs), for the identification of tran-
Although there are many potential uses          tent for which conventional Sanger             scription factor binding sites or the eluci-
for amplicon sequencing, the molecular          sequencing was not informative [2].            dation of DNA-methylation patterns.
biology and software developments               Somatic EGFR mutations that were               Compared with sequencing of small non-
undertaken so far have initially focused on     missed when the Sanger sequencing              coding RNAs (sncRNAs) using the Sanger
oncology research applications, more            method was used were identified.               approach, during which miRNA frag-
specifically on the detection of rare somatic                                                  ments are concatemerised in order to
mutations in complex cancer samples. The        TRANSCRIPTOME AND GENE                         make sequencing more economical, the
ability to sensitively detect somatic muta-     REGULATION STUDIES                             new approach is much more straightfor-
tions in cancer cells promises to be of great   The Genome Sequencer 20 enables the            ward. The often difficult concatemerisa-
help in understanding in much greater           study of transcriptomes at a previously        tion step can simply be skipped. Moreover,
detail the development of cancer at the         impossible depth of coverage and sensitiv-     costs per clonal read are much lower using
genetic level. Additionally, none of the        ity. This is due to the system's massively     the Genome Sequencer 20 System, thus
existing high-throughput technologies           parallel sequencing technology which           providing a real basis for screening for
offers the possibility of novel variant         generates a high number of sequence            scnRNA at a genome-wide level. As an
detection. To demonstrate the power of          reads (minimum of 200,000 single reads         example, Girad et al. used the system in
the new system, previously described            per 5-hour run), thus facilitating the iden-   order to characterise a new class of small
single nucleotide polymorphisms from            tification of previously unknown tran-         RNAs, called piwi-interacting RNAs
upstream of the HLA-DMA gene to the             scripts [3]. Preliminary results from a        (piRNAs), in mouse testes [4]. More than
TAP2 gene in the Class II region of the         short-tag sequencing project also revealed     87,000 reads were generated, around
MHC were chosen as a model system [1].          that the Genome Sequencer 20 System was        53,000 of which would be classified as
It was possible to reproduce the already        very well suited for transcript quantifica-    candidate piRNAs. Other examples
published data using the new system;            tion (data not shown).                         regarding the characterisation of sncRNAs
allele frequencies down to 3% were easily                                                      include the genome-wide analysis of an
detected [Figure 2]. The results of a recent    In terms of gene regulation, the new tech-     Arabidopsis thaliana dicer mutant [5], or
study confirmed that using the Genome           nology has so far been shown to be per-        the characterisation of the piRNA com-
Sequencer 20 System enabled detection of        fectly suited for the genome-wide identifi-    plex from rat testes [6].
low-abundance oncogene mutations in             cation of small non-coding RNAs
                                                                                               The identification of binding sites of
                                                                                               DNA-binding proteins, such as those of
                                                                                               the transcription factor p53 has recently
                                                                                               been described [3]. DNA fragments that
                                                                                               include binding-site sequences can be iso-
                                                                                               lated after immunoprecipitation with
                                                                                               their protecting transcription factors and
                                                                                               characterised using high-throughput
                                                                                               sequencing. This study revealed that bind-
                                                                                               ing sites can be detected with unprece-
                                                                                               dented efficiency and sensitivity.

                                                                                               LOSS OF METHYLATION AND HYPER-
                                                                                               An extremely important regulation mech-
                                                                                               anism of many genes is the loss of methy-
                                                                                               lation (and also hypermethylation) of CpG
                                                                                               islands within promoter regions. Genome
                                                                                               methylation occurs at cytosine residues
Figure 2. Genotyping results of three SNPs in the HLA-DMA gene region (class II MHC).          located 5´ to a guanosine in a CpG dinu-
Base changes along the fragment sequence (x-axis) are colour coded and their positions         cleotide. Dense areas of CpG dinucleotides
shown as bars. The primary y-axis denotes base change frequency. The secondary y-axis as       within promoter regions are organised
well as the black line above the mutation spectrogram represents sequencing coverage. Both     into CpG islands. Applying a known bisul-
high-frequency alleles (top panel) and low-frequency alleles (bottom panel) are shown.         phite treatment procedure, 454 Life

Figure 3. Following extraction from tissue or cells, genomic DNA is treated with sodium bisulphite, which serves to capture the methy-
lation status of the sample. Treatment of DNA with sodium bisulphite results in the deamination of unmethylated cytosines to uracils
while methylated cytosines remain unchanged. The PCR amplification of the converted C (to uracil) will result in the replacement of
thymine for the uracil. Comparison of the sequence obtained from the bisulphite treated amplicon to the published sequence using the
Genome Sequencer 20 amplicon software allows identification of any differential methylation.

Sciences has recently established a sequencing-based technology      3. Ng P et al. Nucleic Acids Res 2006; 34: e84.
to determine quantitatively the methylation state of each CpG        4. Girard A et al. Nature 2006; 442: 199 -202.
dinucleotide in a given target genomic sequence [Figure 3]. To       5. Henderson IR et al. Nat Genet 2006; 38: 721-725.
better understand how the new chemistry performs on cancer           6. Lau NC et al. Science 2006; 313: 363-367.
research samples, eight samples from colo-rectal cancer tumours      7. Kim BN et al. Int J Oncol 2005; 26: 1217-1226.
were analysed, together with matched normal adjacent tissue          8. Herman JG et al. Proc Natl Acad Sci 1996; 93: 9821-9826.
(NAT). The results obtained in this experiment agreed with those
in the published literature: a significant percentage of CRCs show   ROCHE DIAGNOSTICS
methylation of the p16 CpG island [7, 8].                            Mannheim, Germany. Tel +49 621 759 8555
                                                                     More details can be obtained from:
2. Thomas RK et al. Nat Med 2006; 12: 852-855.

To top