Australian National University
Shared by: hcj
-
Stats
- views:
- 0
- posted:
- 12/4/2012
- language:
- Unknown
- pages:
- 56
Document Sample


Coming to grips with next
generation sequencing:
RNA-seq and Gene expression
analysis
Conrad Burden
Mathematical Sciences Institute
Australian National University
Canberra
Next Generation sequencing technologies
Applied Biosystems SOLiD 4
Roche GS-FLX itanium
Illumina HiSeq 2000
Next Generation sequencing technologies
1.Roche (previously 454): Emulsion PCR (polymerase
chain reaction amplification on beads in water droplets
in oil suspension) followed by pyrosequencing in wells
454 GS20 100bp reads, ~20Mbp/run
Roche GS-FLX 250bp reads, ~100 Mbp/run (7.5 hours)
Roche GS-FLX Titanium
400bp reads, ~400 Mbp/run (10 hours)
(“long reads”: suitable for genome assembly)
Next Generation sequencing technologies
2. Illumina (previously Solexa): Bridge amplification in
flow cell and in situ sequencing
Solexa 1G 18bp reads, ~1 Gbp/run
Illumina GA 36bp reads, ~3 Gbp/run
Illumina GAII 75bp paired reads, ~10 Gbp/run (8 days)
Illumina GAIIx 75bp paired reads, ~40 Gbp/run (8 days)
Illumina HiSeq 100bp paired reads, ~200 Gbp/run (8 days)
(“short reads” – functional assays)
Next Generation sequencing technologies
3. Applied biosytems SOLiD (previously Agencourt):
emulsion PCR amplification, beads depositec onto
surface and sequences by ligation
SOLiD 3 50bp paired reads, ~50 Gbp /run (12 days)
SOLiD 4 50bp paired reads, ~100 Gbp/run (12 days)
SOLiD 4hq 75bp paired reads, ~ 300Gbp/run (14 days)
(“short reads” – functional assays)
•Next Generation sequencing technologies
They all have drawbacks:
• Roche (454): Homopolymer errors, i.e. has trouble
sequencing the length of long runs of the same letter
• Illumina (Solexa): Low quality at 3’ end of longer reads
• SOLiD: Double encoding – pairs of letters are read in
‘colour space’, and the sequence needs to be decoded
back into a nucleotide sequence
One application is RNA-seq:
Sequencing RNA to study the ‘Transcriptome’
• gene expression profiling
• expression of non-coding RNA
• determine isoforms: alternative splicing, transcription
start sites, end sites, ...
• allele specific expression
• ...
“Central dogma” of genetics:
DNA RNA protein
-C G- │
-T A- glu
-T A- │
-G C- │
-C G- ala
-A U- │
-C G- │
-G C- arg
-A U- │
“Central dogma” of genetics:
“Expression profiling”
means testing
which genes are
expressed in a cell by
DNA RNA protein
detecting which RNA
is present
-C G- │
-T A- glu
-T A- │
-G C- │
-C G- ala
-A U- │
-C G- │
-G C- arg
-A U- │
Procedure: (RNA-seq for expression profiling on Illumina)
• Start with total RNA from cell
• Poly-(A) pull-down to extract mRNA
• Fragment mRNA (from 300-10,000nt transcripts) to ~200nt
with, e.g., sonification or Mg catalysed hydrolysis
• Reverse transcribe to cDNA with random hexamer primers
• Ligate sequencing adapter
• Size select on gel
• PCR enrichment to produce a “cDNA library”
• Feed it through the sequencer to get short reads 25 to
100nt (see next slide)
• Map onto known exons (and worry about multiple
mappings, splices across exons, read errors, …)
Illumina: how it works
Alignment to a reference genome:
reference:
...ACCTGTGACCTGTCAATCGGGAATCTGAACTGTGAATCTC...
read: CTGTCAATCGGGAAT
Alignment to a reference genome:
reference:
...ACCTGTGACCTGTCAATCGGGAATCTGAACTGTGAATCTC...
read: CTGTCAATCGGGAAT
Alignment to a reference genome:
1. Burrows-Wheeler transform:
• First creates an index for reference genome
(computationally expensive but reusable and with small
memory footprint)
• Default mode of Bowtie, for example, is fast but only
finds the first exact match
• Returning multiple matches or allowing mismatches
slows the performance considerably
Implementations
• Bowtie
• BWA
• Soap2
Alignment to a reference genome:
2. Hash tables:
• First creates a hash table index for reference genome
(computationally expensive but reusable and has a large
memory footprint)
• BFAST uses index to find a candidate alignment and then
Smith-Waterman algorithm
• Slower than Burrows-Wheeler but finds multiple
alignments and can handle insertions and deletions
Implementations
• PerM
• SHRIMP
• BFAST
• ELAND
Alignment to a reference genome:
Example: A. Mortazavi et al. Nature Methods 5, 621-628 (2008)
mRNA from mouse liver and muscle, sequenced by Illumina,
mapped using ERANGE (= ELAND + multi-reads distributed in proportion to
similar nearby reads)
Distribution of multiple reads Distribution of uniquely mappable
reads (liver sample)
Compexity of the transcriptome:
Compexity of the transcriptome:
Dealing with splice junctions
TopHat: Trapnell et al., Bioinformatics 25, 1105-1111 (2009)
• Fast splice junction mapper for RNA-seq reads built
on top of the BowTie aligner
• Doesn’t rely on known splice sites
• Takes into account known donor-acceptor sites
• 2.2 million reads per CPU hour “an entire RNA-seq
experiment in less than a day on a standard desktop
computer”
Suppose all the counts have been mapped, the exons identified
and the results summarised into a nice table of counts mapped
onto each gene:
Sample A Sample B ... etc.
Gene
Rep 1 Rep 2 Rep 1 Rep 2
ENSG00000209432 4 6 35 45
ENSG00000209432 0 0 2 1
ENSG00000209432 110 96 177 203
typically
10,000 to ENSG00000209432 12685 10897 9246 9873
20,000 genes
ENSG00000212678 148 201 112 93
... etc.
This is not the end of the story. We still need to normalise,
estimate the uncertainties, detect differential expression ...
Normalisation
Ideally, one would like to convert counts to a measure of
absolute transcript abundance in the original sample or cell
(transcripts/cell or a molar concentration). Not so easy! ...
Normalisation
First idea: RPKM
(Reads per kilobase of exon model per million mapped reads)
- Mortazavi et al., Nature Methods, 5, 621-628
Assume that the number of reads from a given transcript is
proportional to (1) molar concentration and (2) length of transcript
# reads for this gene 10 6
Abundance in RPKM =
total # reads transcript length in kb
...not an absolute measurement, but in an arbitrary unit which is
scaled by the total number of reads, and therefore unique to the
particular lane of the particular run of the sequencer.
RPKM is OK for measuring relative abundance of transcripts within
one experiment (e.g. one lane of Illumina sequencer):
RNA spikes in Illumina:
• 300 and 1500nt
(arabidopsis) and 10000nt (-
phage)
• 104, 105, …, 109 transcripts
per 100ng mRNA
– data from Mortazavi et al.
... but is no use for comparing between experiments (e.g. for
detecting differential expression) because it is compositional data
Three genes, two are unchanged, the other is down-regulated:
350
250
150 150
100 100
Condition A Condition B
Three genes, two are unchanged, the other is down-regulated:
350
250
150 150
100 100
Condition A Condition B
Scale by total number of counts:
350/600
250/500
150/500
100/500
150/600
100/600
Condition A Condition B
... then the first two genes (incorrectly!) register as being up-regulated
Second idea: Why not use spike-ins?
Opinions differ on whether this could be made to work.
In M.D. Robinson and A. Oshlack “A scaling normalisation method for
differential expression analysis of RNA-seq data”, Genome Biology 11, R25 (2010),
it is claimed that
“In order to use spike-in controls for normalisation, the ratio of
the concentration of the spike to the sample must be kept
constant throughout the experiment. In practice this is difficult
to achieve and small variations will lead to biased estimation of
the normalisation factor.”
based on analysing the spike-ins of Mortazavi et al. of 1.2 × 104, …,
1.2 × 109 transcripts per 100ng mRNA
Third idea: Use the distribution of counts over the entire
genome to normalise between samples (implicitly
assumes most genes are not differentially expressed):
1. Upper quartile normalisation,
J.H. Bullard et al., BMC Bioinformatics 11, 94 (2010)
2. TMM (“trimmed mean of M values”) normalisation,
M.D. Robinson and A. Oshlack, Genome Biology 11 R25
(2010)
3. Median count ratio,
S. Anders and W. Huber, Genome Biology 11 R106
(2010)
1. Upper quartile normalisation,
J.H. Bullard et al., BMC Bioinformatics 11, 94 (2010)
density
Sample A
log(counts)
density
Sample B
log(counts)
1. Upper quartile normalisation,
J.H. Bullard et al., BMC Bioinformatics 11, 94 (2010)
density
Sample A
log(counts)
Upper quartiles
density
Sample B
log(counts)
1. Upper quartile normalisation,
J.H. Bullard et al., BMC Bioinformatics 11, 94 (2010)
density
Sample A
log(counts)
Upper quartiles Relative scale log (λUQ)
density
Sample B
log(counts)
2. TMM (“trimmed mean of M values”) normalisation,
M.D. Robinson and A. Oshlack, Genome Biology 11 R25
(2010)
Yig = read counts for gene g in sample i = 1, 2
Ni = total read counts for sample i = 1, 2
Y1g Y2g 1 Y1g
Y2g
M log log A log
, log
N1 N 2 2 N1
N 2
M-A plot
M = log(Y1/N1) – log(Y2/N2)
A = ½{log(Y1/N1) – log(Y2/N2)}
A few strongy expressed,
M-A plot differentially expressed
genes which shift the
quantiles ...
M = log(Y1/N1) – log(Y2/N2)
A = ½{log(Y1/N1) – log(Y2/N2)}
A few strongy expressed,
M-A plot differentially expressed
genes which shift the
quantiles ...
M = log(Y1/N1) – log(Y2/N2) ... trim them off!
A = ½{log(Y1/N1) – log(Y2/N2)}
Then ,from the trimmed subset of genes,
calculate a relative scaling factor from a weighted
average of M –values:
w g Mg
TMM
genes, g
w g
genes, g
where
1 1 1 1 1
w g
Y N1 Y2g N 2 asymptotic variance
1g
A few strongy expressed,
M-A plot differentially expressed
genes which shift the
quantiles ...
M = log(Y1/N1) – log(Y2/N2) ... trim them off!
log λTMM
A = ½{log(Y1/N1) – log(Y2/N2)}
3. Median count ratio,
S. Anders and W. Huber, Genome Biology 11 R106 (2010)
Yjg = read counts for gene g in sample j = 1, ..., m
Scaling factor for gene j
Y jg
s j median
ˆ
genes, g m m
1/
Yg
1
Relative scaling between samples 1 and 2 is
ˆ
s2
MCR
ˆ
s1
Probability distribution of read counts
Normalisation factors set a relative scale for each experiment for
the expected number of read counts for each gene, but does not
tell us about the distribution of read counts about the mean
Simplest hypothesis: Poisson Distribution
Recall Poisson is the limit of binomial as the number of ‘trials’ gets
big but the probability of ‘success’ gets small
binn, p Pois() as n , p 0, np
bin(50, 0.1) Pois(0.5)
> 0.1 g of RNA fragmented to 200 nt in Illumina flow cell
6 1023 Da/g
0.1106 g ~ 1012 RNA fragments
200nt 300Da/nt in flow cell
Total number of reads n ~ 107 RNA solution is not depleted
i.e., probability that a given read corresponds to a given expressed
gene g depends only on the relative abundance of that gene:
g
number of transcripts from gene in flow cell
pg
total number of transcripts in flow cell
typically ~ 10-6 -10-3
Then number of reads of gene g is Poisson with mean
npg ~ 10 1 10 4
Compare with 2 technical replicates of RNA-seq counts in RPKM
from a real data set of mouse transcript reads with artificially
generated Poisson data
Artificially generated data: 100,000
From Mortazavi et al., Nature pairs of Poison random numbers with
Methods 5, 621-628 (2008) means drawn from a log-uniform
distribution
Marioni et al., Genome Res. 18 1509-1517 (2008) have confirmed
that for technical replicates read counts are Poisson random
variables
However, in general there are further sources of variance from
biological replicates and library preparation:
From Anders and Huber,
Genome Biology, Over-dispersion:
11:R106 (2010), variance > mean
2 biological replicates of
RNA-seq data from fly
embryos
Poisson:
variance = mean
Overdispersion modelled with Negative Binomial random
variable:
read counts for gene g, sample i: Yig ∼ NB(μig, σig2)
Mean μig = cig lg λi scaling factor
for this sample
‘true’ transcript
concentration length
M.D.Robinson, et al., Bioinformatics
Variance σig2 = μig(1 + μigφg) 26: 139-140 (2010)
Software: edgeR
S. Anders and W. Huber, Genome
or σig 2=μ
ig + λi2 v(cig lg) Biology, 11: R106(2010)
Software: DESeq
fitted variance function
Detecting differential expression with RNA-seq
-suffers from ‘length bias’: longer transcripts have more
reads ⇒ more information ⇒ more power to detect DE!
A. Oshlack and M.J. Wakefield, Transcript length bias in RNA-seq data confounds
systems biology, Biology Direct, 4: 14 (2009)
Percentage DE
Percentage DE
Transcript length (bp)
(Oshlack and Wakefield use D = (Y1g – Y2g)/lg as a test statistic and a
t-test. A better test is to use R = Y1g /Y2g and test against ratio of two
Poissons with rate ratio λ1/λ2. But the qualitative result persists.
The following numerical simulation illustrates what’s going on
• 5,000 virtual ‘transcripts’ (c.f. ∼ 20,000 genes in real genome)
• length log-normally distributed 103.3 ± 0.3 s.d. bp.
• copy numbers log-normally distributed 103.0 ± 0.6 s.d.
• fragment transcripts at random points
• size select to 180 ≤ L ≤ 220 bp.
• produces a ‘library’ of ∼ 106-107 fragments (c.f. ∼ 1012 in real world)
• select sample of 50,000 fragments in virtual ‘sequencer’
(c.f. ∼ 107 in real world)
• map each one back to produce a table of counts for each gene
...
...
...
...
...
...
Differentially expressed
...
...
...
...
...
...
Not differentially expressed
...
...
...
...
...
...
Differentially expressed
...
...
...
...
...
...
Not differentially expressed
Test for differential expression (using Poisson rate ratio test
with relative normalisation λ1/λ2 determined from total read
counts for non-differentially expressed transcripts) shows
length bias:
Test for differential expression (using Poisson rate ratio test
with relative normalisation λ1/λ2 determined from total read
counts for non-differentially expressed transcripts) shows
length bias:
Why it’s happening:
Why it’s happening:
The length bias wil cause problems with compiling ranked lists of differentially
expressed genes. What can be done about it?
• Use a fixed window size instead of the whole transcript?
- this amounts to throwing away data ... not very satisfactory
• Incorporate a ‘weighting’ (i.e. fudge) factor into the statistic, e.g. Bullard et al.,
BMC Bioinf. 11:94 (2010): multiply t-statistic by √(transcript length)
- a bit ad hoc ... not very satisfactory
• Sorry, looks like we might be stuck with it!
Acknowledgements
Sue Wilson, Mathematical Sciences Institute, Australian
National University and School of Mathematics and Statistics,
University of New South Wales
Jen Taylor, Division of Plant Industry, CSIRO, Australia
Sumaira Qureshi, Mathematical Sciences Institute, Australian
National University
Jose Robles, Division of Plant Industry, CSIRO, Australia
Get documents about "