Docstoc

RNA seq data analysis workshop

Document Sample
RNA seq data analysis workshop Powered By Docstoc
					RNA-seq data
analysis workshop
Gabriel Kolle, PhD
Bioinformatics Support Scientist
    Overview



     Introduction to RNA-seq
     Experimental design
     Data analysis
      –   Read alignment
      –   Splcing
      –   Expression calculation and differential expression
      –   RNA de novo




2
     Measuring RNA expression




     ATG                                    AAAAAAA

                         ATG                AAAAAAA

                         ATG     AAAAAAA

                         ATG     AAAAAAA


                         ATG     AAAAAAA

                                pA         pA
TSS1       TSS2   TSS3          PAS1        PAS2




                                                   RNAseq




 3
    RNA-seq applications




                                                                Alt Splicing
                                                     ncRNA
                           miRNA-Seq
     NET-Seq


                                                                 Gene Fusions
                            mRNA-Seq


                                                                 Ribo-Seq

    Nature Reviews Genetics 8, 413-423 (June 2007) - Modified
4
    Workflow – library prep and sequencing



                      • removal of abundant RNAs?
       RNA extraction   (polyA+, ribosomal depletion,
                        DSN etc)


           cDNA         • Stranded or not?
         synthesis

           Library    • Sequence depth?
       production and • Paired end?
        Sequencing


5
    RNA extraction and preparation



                   polyA+               Contaminant
                                        sequence
                              Ribo-
                                        removal
          Other
                             depleted
                                        Non-coding?
                                        Background
                                        toleration?
            RNA-Chip        DSN




6
    Maintaining the strand?

       Strand non-specific         Strand specific

      • Random primed           • RNA ligation/dUTP/
        cDNA sythesis
                                • Important when trying
      • Strand resolved           to resolve overlapping
        bioinformatically for     genes
        most genes in well-
        annotated annotated     • Resolve strand for
        genomes                   single exon and non-
                                  coding genes (when
                                  annotation not
                                  available)


7
    Sequence depth and type

                                     Typical            Transcript
                  GEx                 GEx +            complexity            Discovery


               36-50bp          Single or paired end   Paired-end            Paired-end
               Single End       (2x50)                 (2x50, 2x100)         (2x50, 2x100)

               Gene level       Gene, exon             Alt-splicing,         Transcriptome
               expression       expression                                   de-novo
                                                       SNP/Indel
               Sample           Alt splicing (med-     identification        High resolution
               classification   high expressed                               SNP/Indel
                                genes)


                   10           20            30         40             50          60 million

     Level of sensitivity is proportional to the number of reads
     Longer reads/paired end reads result in better mapping and resolution of splicing


8
    Paired reads



                                        cDNA Insert




     Better mapping rate (increase chance of resolving between homologous genes)
       – Higher quality SNP calling
     Resolve exon-exon usage (more effective coverage across splice junctions)
       – Better resolution for fusion events
     Essential for de-novo –
       – Overcome any local dips in coverage



9
     Replication
          High Reproducibility
            with Low Inputs                     R2 Correlation Between Replicates




      Technical variability is typically very low – technical replication generally not
      required
      Biological replication (2+ samples) highly recommended


      http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf
10
      RNA-seq – data, reads and coverage




  ATG                                                  AAAAAAA

                          ATG                          AAAAAAA

                          ATG               AAAAAAA

                          ATG               AAAAAAA


                          ATG               AAAAAAA

                                           pA         pA
TSS1        TSS2   TSS3                    PAS1        PAS2




                                                              Reads




                                                              Coverage


 11
     Typical data analysis workflow

                                                 Chromosomes          Splice jxns   rRNA


     1. Align single-ended reads to:
         o Reference sequence
         o Known splice junctions
         o Abundant sequences                                                              discard

     2. Optional (find splicing junctions de
         novo)
                                                          Bam files
     3. Post Alignment
         o Determine expression levels of
             genes, exons and splice junctions
                                                          Expression
         o   Combine samples and determine                 counting
             differential expression

                                                                   Combine
                                                               Samples/differential
                                                                  expression
12
     RNA-Seq tools



                          • Tophat (Bowtie)
              Alignment   • BWA


             Expression • Cufflinks
              counting • HT-Seq

             Differential • Cufflinks
             expression • R-tools (DESeq, EdgeR)



13
     Outputs


      BAM files
        – Store alignment and read
          information (can be used to
          archive experiment)
        – Compatible with many browsers
          (coverage, read and base level
          visualisation)




      Tables of gene, exon and splicing
      data
        – Raw counts
        – RPKM (normalised for length and
          sequence depth)



14
     RNA-Seq Tools: TopHat / Cufflinks



     Features
      Supports paired-end reads
      Discovers novel splice
      junctions
      Transcript quantification of
      individual gene isoforms
      Discovers novel transcripts and
      isoforms
      Calculates differential
      expression




15
     TopHat / Cufflinks workflow


      workflow                                  software tools

                                                             Aligns reads
         Align            Map reads
                                                 Bowtie      to genome
                     (Genome + Annotations)                  assembly


                                                             PE or SR reads,
                                                             identifies splice
                       Identify transcripts      Tophat      junctions and exons
        Annotate
                          and junctions                      (no gene models
                                                             needed)

                                                             Uses tophap output
                       Expression levels,       Cufflinks    to assemble
                                                             transcripts and
        Analyze       Differential expression                estimate
                             visualize          Broad IGV
                                                             abundance




16
     Differential Expression Workflow for 2 Samples
     with Existing Gene Annotation
                                            Transcript
              Splice Junction
                                            Quantification
              Discovery and
                                            and Differential
              Alignment
                                            Expression



             R A                  BEDs
 Fastq 1           TopHat         BAM


                                                                 Normalized
                                                                 expression
                                                                  (FPKM)
                                                CuffDiff
                                                                 Differential
                                                                Expression,
                                                                  Splicing,
             R A                                               CDS, Promoters
 Fastq 2           TopHat         BAM
                                  BEDs




     R A Reference Sequence and
         Gene Annotation




17
     Visualization of TopHat / Cuffdiff Output in IGV




     Liver


Skeletal




             Differential read coverage in IGV




                                                 Spliced read alignments in IGV

18
     Visualising Junction usage in IGV




                                                  Alt
                                                 splice




              TopHat junctions




                                         Differential junction usage

19
     Differential Expression: CuffDiff output
     reports for genes, isoforms, coding seqs, and promoters




                                                                         Log       FDR-adjusted
                                                               FPKMs Fold-change     p-value
      Pair-wise differential expression




                                                                          FPKMs
                                                                         FPKM 1    FPKM 2         …




      FPKM for all samples


20
     De Novo Assembly



      Transcriptome de novo is a powerful method for finding genes in species
      without a reference
      De Novo methods need to take into account:
       – Dramatic differences in coverage
       – Alternative splicing transcripts
       – Overlapping transcripts




21
     Transcriptome de-novo - Trinity




          • Inchworm assembles the RNA-Seq data into the unique sequences of
            transcripts
          • Chrysalis clusters the Inchworm contigs into clusters and constructs
            complete de Bruijn graphs for each cluster
          • Butterfly ultimately reporting full-length transcripts for alternatively
            spliced isoforms



                           http://trinityrnaseq.sourceforge.net/


22
     RNAseq publications



        Encode project, RNAseq standards document (June 2011)

          http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAse
          q_Standards_V1.0.pdf


        Garber et al., Nat Methods (review of computational methods for RNA-
         seq)

          http://www.nature.com/nmeth/journal/v8/n6/full/nmeth.1613.html




23
     RNAseq Software



        Software
          CASAVA:      available from illumina iCom (myillumina.com)
          Bowtie:      http://bowtie-bio.sourceforge.net/index.shtml
          TopHat:      http://tophat.cbcb.umd.edu/
          Cufflinks:   http://cufflinks.cbcb.umd.edu/
          IGV:         http://www.broadinstitute.org/igv/
          Trinity:     http://trinityrnaseq.sourceforge.net/

        Genomes:
          iGenomes:    https://icom.illumina.com/message/iGenome




24
     What you learned today


      RNA-seq basics
      Important experimental design considerations
      Alignment of RNA-seq data
      Calculating expression level
      Visualising RNA-seq data
      RNA-seq without a reference




25
     Questions…..




26

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:58
posted:6/13/2012
language:English
pages:26