Australian National University

Shared by: hcj
Categories
Tags
-
Stats
views:
0
posted:
12/4/2012
language:
Unknown
pages:
56
Document Sample
scope of work template
							 Coming to grips with next
  generation sequencing:
RNA-seq and Gene expression
         analysis

            Conrad Burden
    Mathematical Sciences Institute
    Australian National University
               Canberra
Next Generation sequencing technologies




                                             Applied Biosystems SOLiD 4
Roche GS-FLX itanium




                       Illumina HiSeq 2000
 Next Generation sequencing technologies

1.Roche (previously 454):    Emulsion PCR (polymerase
  chain reaction amplification on beads in water droplets
  in oil suspension) followed by pyrosequencing in wells

454 GS20       100bp reads, ~20Mbp/run

Roche GS-FLX    250bp reads, ~100 Mbp/run (7.5 hours)

Roche GS-FLX Titanium
                400bp reads, ~400 Mbp/run (10 hours)

    (“long reads”: suitable for genome assembly)
 Next Generation sequencing technologies
2. Illumina (previously Solexa): Bridge amplification in
  flow cell and in situ sequencing

Solexa 1G          18bp reads, ~1 Gbp/run

Illumina GA        36bp reads, ~3 Gbp/run

Illumina GAII      75bp paired reads, ~10 Gbp/run (8 days)

Illumina GAIIx     75bp paired reads, ~40 Gbp/run (8 days)

Illumina HiSeq 100bp paired reads, ~200 Gbp/run (8 days)

     (“short reads” – functional assays)
 Next Generation sequencing technologies

3. Applied biosytems SOLiD (previously Agencourt):
  emulsion PCR amplification, beads depositec onto
  surface and sequences by ligation

SOLiD 3        50bp paired reads, ~50 Gbp /run (12 days)

SOLiD 4        50bp paired reads, ~100 Gbp/run (12 days)

SOLiD 4hq      75bp paired reads, ~ 300Gbp/run (14 days)

          (“short reads” – functional assays)
•Next Generation sequencing technologies

They all have drawbacks:

• Roche (454): Homopolymer errors, i.e. has trouble
sequencing the length of long runs of the same letter

• Illumina (Solexa): Low quality at 3’ end of longer reads

• SOLiD: Double encoding – pairs of letters are read in
‘colour space’, and the sequence needs to be decoded
back into a nucleotide sequence
One application is RNA-seq:

Sequencing RNA to study the ‘Transcriptome’

• gene expression profiling
• expression of non-coding RNA
• determine isoforms: alternative splicing, transcription
              start sites, end sites, ...
• allele specific expression
• ...
“Central dogma” of genetics:

DNA           RNA              protein

 -C            G-                │
 -T            A-               glu
 -T            A-                │
 -G            C-                │
 -C            G-               ala
 -A            U-                │
 -C            G-                │
 -G            C-               arg
 -A            U-                │
“Central dogma” of genetics:
                         “Expression profiling”
                                  means testing
                                which genes are
                              expressed in a cell by
DNA              RNA              protein
                              detecting which RNA
                                    is present
 -C               G-                  │
 -T               A-                 glu
 -T               A-                  │
 -G               C-                  │
 -C               G-                 ala
 -A               U-                  │
 -C               G-                  │
 -G               C-                 arg
 -A               U-                  │
Procedure: (RNA-seq for expression profiling on Illumina)

• Start with total RNA from cell
• Poly-(A) pull-down to extract mRNA
• Fragment mRNA (from 300-10,000nt transcripts) to ~200nt
  with, e.g., sonification or Mg catalysed hydrolysis
• Reverse transcribe to cDNA with random hexamer primers
• Ligate sequencing adapter
• Size select on gel
• PCR enrichment to produce a “cDNA library”
• Feed it through the sequencer to get short reads 25 to
  100nt (see next slide)
• Map onto known exons (and worry about multiple
  mappings, splices across exons, read errors, …)
Illumina: how it works
Alignment to a reference genome:


reference:
...ACCTGTGACCTGTCAATCGGGAATCTGAACTGTGAATCTC...


read:          CTGTCAATCGGGAAT
Alignment to a reference genome:


reference:
...ACCTGTGACCTGTCAATCGGGAATCTGAACTGTGAATCTC...


read:          CTGTCAATCGGGAAT
Alignment to a reference genome:

  1. Burrows-Wheeler transform:
     •   First creates an index for reference genome
         (computationally expensive but reusable and with small
         memory footprint)
     •   Default mode of Bowtie, for example, is fast but only
         finds the first exact match
     •   Returning multiple matches or allowing mismatches
         slows the performance considerably

     Implementations

     •   Bowtie
     •   BWA
     •   Soap2
Alignment to a reference genome:

  2. Hash tables:
     •   First creates a hash table index for reference genome
         (computationally expensive but reusable and has a large
         memory footprint)
     •   BFAST uses index to find a candidate alignment and then
         Smith-Waterman algorithm
     •   Slower than Burrows-Wheeler but finds multiple
         alignments and can handle insertions and deletions

     Implementations

     •   PerM
     •   SHRIMP
     •   BFAST
     •   ELAND
Alignment to a reference genome:

Example: A. Mortazavi et al. Nature Methods 5, 621-628 (2008)

mRNA from mouse liver and muscle, sequenced by Illumina,
mapped using ERANGE (= ELAND + multi-reads distributed in proportion to
similar nearby reads)




  Distribution of multiple reads     Distribution of uniquely mappable
                                     reads (liver sample)
Compexity of the transcriptome:
Compexity of the transcriptome:
Dealing with splice junctions
TopHat: Trapnell et al., Bioinformatics 25, 1105-1111 (2009)
   • Fast splice junction mapper for RNA-seq reads built
   on top of the BowTie aligner
   • Doesn’t rely on known splice sites
   • Takes into account known donor-acceptor sites


   • 2.2 million reads per CPU hour “an entire RNA-seq
   experiment in less than a day on a standard desktop
   computer”
Suppose all the counts have been mapped, the exons identified
and the results summarised into a nice table of counts mapped
onto each gene:
                                  Sample A         Sample B       ... etc.
                   Gene
                                 Rep 1   Rep 2 Rep 1 Rep 2
               ENSG00000209432      4        6      35    45
               ENSG00000209432      0        0       2        1
               ENSG00000209432    110        96   177    203
  typically
 10,000 to     ENSG00000209432   12685 10897      9246   9873
20,000 genes
               ENSG00000212678    148     201     112     93

                   ... etc.


 This is not the end of the story. We still need to normalise,
 estimate the uncertainties, detect differential expression ...
Normalisation
Ideally, one would like to convert counts to a measure of
absolute transcript abundance in the original sample or cell
(transcripts/cell or a molar concentration). Not so easy! ...
Normalisation
First idea: RPKM
(Reads per kilobase of exon model per million mapped reads)
 - Mortazavi et al., Nature Methods, 5, 621-628

Assume that the number of reads from a given transcript is
proportional to (1) molar concentration and (2) length of transcript

                          # reads for this gene  10 6
 Abundance in RPKM =
                     total # reads  transcript length in kb

...not an absolute measurement, but in an arbitrary unit which is
scaled by the total number of reads, and therefore unique to the
    particular lane of the particular run of the sequencer.
RPKM is OK for measuring relative abundance of transcripts within
one experiment (e.g. one lane of Illumina sequencer):




                                         RNA spikes in Illumina:
                                         • 300 and 1500nt
                                         (arabidopsis) and 10000nt (-
                                         phage)
                                         • 104, 105, …, 109 transcripts
                                         per 100ng mRNA

                                         – data from Mortazavi et al.




   ... but is no use for comparing between experiments (e.g. for
detecting differential expression) because it is compositional data
Three genes, two are unchanged, the other is down-regulated:
                                  350

                                                               250
                      150                           150
              100                          100



                    Condition A                  Condition B
Three genes, two are unchanged, the other is down-regulated:
                                   350

                                                                   250
                        150                             150
               100                             100



                     Condition A                     Condition B

Scale by total number of counts:
                              350/600
                                                               250/500
                                                        150/500
                                              100/500
                        150/600
              100/600


                     Condition A                     Condition B
... then the first two genes (incorrectly!) register as being up-regulated
Second idea: Why not use spike-ins?

Opinions differ on whether this could be made to work.
In M.D. Robinson and A. Oshlack “A scaling normalisation method for
differential expression analysis of RNA-seq data”, Genome Biology 11, R25 (2010),
it is claimed that

    “In order to use spike-in controls for normalisation, the ratio of
    the concentration of the spike to the sample must be kept
    constant throughout the experiment. In practice this is difficult
    to achieve and small variations will lead to biased estimation of
    the normalisation factor.”

based on analysing the spike-ins of Mortazavi et al. of 1.2 × 104, …,
1.2 × 109 transcripts per 100ng mRNA
Third idea: Use the distribution of counts over the entire
genome to normalise between samples (implicitly
assumes most genes are not differentially expressed):

1. Upper quartile normalisation,
   J.H. Bullard et al., BMC Bioinformatics 11, 94 (2010)

2. TMM (“trimmed mean of M values”) normalisation,
   M.D. Robinson and A. Oshlack, Genome Biology 11 R25
   (2010)

3. Median count ratio,
   S. Anders and W. Huber, Genome Biology 11 R106
   (2010)
1. Upper quartile normalisation,
   J.H. Bullard et al., BMC Bioinformatics 11, 94 (2010)




                 density
      Sample A

                                      log(counts)
                 density




      Sample B
                                      log(counts)
1. Upper quartile normalisation,
   J.H. Bullard et al., BMC Bioinformatics 11, 94 (2010)




                        density
      Sample A

                                      log(counts)
      Upper quartiles
                        density




      Sample B
                                      log(counts)
1. Upper quartile normalisation,
   J.H. Bullard et al., BMC Bioinformatics 11, 94 (2010)




                        density
      Sample A

                                      log(counts)
      Upper quartiles             Relative scale log (λUQ)
                        density




      Sample B
                                      log(counts)
2. TMM (“trimmed mean of M values”) normalisation,
   M.D. Robinson and A. Oshlack, Genome Biology 11 R25
   (2010)


     Yig = read counts for gene g in sample i = 1, 2

     Ni = total read counts for sample i = 1, 2

         Y1g   Y2g  1  Y1g         
                                          Y2g 
  M  log  log  A     log 
                          ,    log               
         N1    N 2  2  N1 
                                               
                                          N 2 
                                                 
M-A plot

    M = log(Y1/N1) – log(Y2/N2)




                                  A = ½{log(Y1/N1) – log(Y2/N2)}
                                                                   A few strongy expressed,
M-A plot                                                           differentially expressed
                                                                   genes which shift the
                                                                   quantiles ...
    M = log(Y1/N1) – log(Y2/N2)




                                  A = ½{log(Y1/N1) – log(Y2/N2)}
                                                                   A few strongy expressed,
M-A plot                                                           differentially expressed
                                                                   genes which shift the
                                                                   quantiles ...
    M = log(Y1/N1) – log(Y2/N2)                                      ... trim them off!




                                  A = ½{log(Y1/N1) – log(Y2/N2)}
Then ,from the trimmed subset of genes,
calculate a relative scaling factor from a weighted
average of M –values:

                             w        g   Mg
                 TMM
                        
                            genes, g

                                w         g
                              genes, g



where
          1  1   1   1            1
            
   w g  
         Y   N1  Y2g N 2  asymptotic variance
          1g             
                                                                   A few strongy expressed,
M-A plot                                                           differentially expressed
                                                                   genes which shift the
                                                                   quantiles ...
    M = log(Y1/N1) – log(Y2/N2)                                      ... trim them off!




                                                                          log λTMM




                                  A = ½{log(Y1/N1) – log(Y2/N2)}
3. Median count ratio,
   S. Anders and W. Huber, Genome Biology 11 R106 (2010)

   Yjg = read counts for gene g in sample j = 1, ..., m

 Scaling factor for gene j
                                        Y jg
                   s j  median
                   ˆ
                         genes, g    m    m
                                            1/

                                     
                                     Yg 
                                     1 
 Relative scaling between samples 1 and 2 is

                           ˆ
                             s2
                    MCR
                           
                             ˆ
                             s1
Probability distribution of read counts
Normalisation factors set a relative scale for each experiment for
the expected number of read counts for each gene, but does not
tell us about the distribution of read counts about the mean

Simplest hypothesis: Poisson Distribution
Recall Poisson is the limit of binomial as the number of ‘trials’ gets
big but the probability of ‘success’ gets small
     binn, p  Pois() as n  , p  0, np  

                  bin(50, 0.1)                          Pois(0.5)
> 0.1 g of   RNA fragmented to 200 nt in Illumina flow cell 

                 6 1023 Da/g
 0.1106 g                    ~ 1012             RNA fragments
               200nt  300Da/nt                         in flow cell

Total number of reads n ~ 107  RNA solution is not depleted
i.e., probability that a given read corresponds to a given expressed
gene g depends only on the relative abundance of that gene:


                                         g
         number of transcripts from gene in flow cell
    pg 
            total number of transcripts in flow cell
        typically ~ 10-6 -10-3
Then number of reads of gene g is Poisson with mean
                         npg ~ 10 1 10 4
Compare with 2 technical replicates of RNA-seq counts in RPKM
from a real data set of mouse transcript reads with artificially
generated Poisson data




                                      Artificially generated data: 100,000
   From Mortazavi et al., Nature      pairs of Poison random numbers with
   Methods 5, 621-628 (2008)          means drawn from a log-uniform
                                      distribution
Marioni et al., Genome Res. 18 1509-1517 (2008) have confirmed
that for technical replicates read counts are Poisson random
variables

However, in general there are further sources of variance from
biological replicates and library preparation:

From Anders and Huber,
Genome Biology,                                       Over-dispersion:
11:R106 (2010),                                       variance > mean
2 biological replicates of
RNA-seq data from fly
embryos

                                                      Poisson:
                                                      variance = mean
Overdispersion modelled with Negative Binomial random
variable:
read counts for gene g, sample i: Yig ∼ NB(μig, σig2)

    Mean             μig = cig lg λi               scaling factor
                                                  for this sample

               ‘true’              transcript
           concentration             length


                                                   M.D.Robinson, et al., Bioinformatics
Variance       σig2 = μig(1 + μigφg)               26: 139-140 (2010)
                                                   Software: edgeR

                                                   S. Anders and W. Huber, Genome
      or       σig   2=μ
                           ig +   λi2 v(cig lg)    Biology, 11: R106(2010)
                                                   Software: DESeq
                     fitted variance function
     Detecting differential expression with RNA-seq
     -suffers from ‘length bias’: longer transcripts have more
     reads ⇒ more information ⇒ more power to detect DE!
     A. Oshlack and M.J. Wakefield, Transcript length bias in RNA-seq data confounds
     systems biology, Biology Direct, 4: 14 (2009)
Percentage DE




                                              Percentage DE




                Transcript length (bp)
(Oshlack and Wakefield use D = (Y1g – Y2g)/lg as a test statistic and a
t-test. A better test is to use R = Y1g /Y2g and test against ratio of two
Poissons with rate ratio λ1/λ2. But the qualitative result persists.


The following numerical simulation illustrates what’s going on

• 5,000 virtual ‘transcripts’ (c.f. ∼ 20,000 genes in real genome)
• length log-normally distributed 103.3 ± 0.3 s.d. bp.
• copy numbers log-normally distributed 103.0 ± 0.6 s.d.
• fragment transcripts at random points
• size select to 180 ≤ L ≤ 220 bp.
• produces a ‘library’ of ∼ 106-107 fragments (c.f. ∼ 1012 in real world)
• select sample of 50,000 fragments in virtual ‘sequencer’
                                 (c.f. ∼ 107 in real world)
• map each one back to produce a table of counts for each gene
                  ...
...




                        ...
      ...

            ...




                              ...
                                    Differentially expressed
...

      ...

            ...


                  ...



                        ...

                              ...
                                    Not differentially expressed
                  ...
...




                        ...
      ...

            ...




                              ...
                                    Differentially expressed
...

      ...

            ...


                  ...



                        ...

                              ...
                                    Not differentially expressed
Test for differential expression (using Poisson rate ratio test
with relative normalisation λ1/λ2 determined from total read
counts for non-differentially expressed transcripts) shows
length bias:
Test for differential expression (using Poisson rate ratio test
with relative normalisation λ1/λ2 determined from total read
counts for non-differentially expressed transcripts) shows
length bias:
Why it’s happening:
Why it’s happening:
The length bias wil cause problems with compiling ranked lists of differentially
expressed genes. What can be done about it?
• Use a fixed window size instead of the whole transcript?
   - this amounts to throwing away data ... not very satisfactory
• Incorporate a ‘weighting’ (i.e. fudge) factor into the statistic, e.g. Bullard et al.,
  BMC Bioinf. 11:94 (2010): multiply t-statistic by √(transcript length)




         - a bit ad hoc ... not very satisfactory
• Sorry, looks like we might be stuck with it!
                   Acknowledgements

Sue Wilson, Mathematical Sciences Institute, Australian
National University and School of Mathematics and Statistics,
University of New South Wales

Jen Taylor, Division of Plant Industry, CSIRO, Australia

Sumaira Qureshi, Mathematical Sciences Institute, Australian
National University

Jose Robles, Division of Plant Industry, CSIRO, Australia

						
Related docs
Other docs by hcj