Slide 1 - Informatics Institute

Document Sample
Slide 1 - Informatics Institute Powered By Docstoc
					Prediction of RNA
secondary structure
•    Features of RNA secondary structure,
•    Dot matrix analysis, Global Energy
     Minimization Methods, Mfold, VIENNA
     RNA package
•    Dynamic Programming methods and
     Sequence Covariation methods

Computational analysis of microRNAs
 •   Small RNA world
 •   Computational identification of miRNAs
 •   Prediction of miRNA targets
Prediction of RNA
secondary structure
RNA follows the same basic rules of base-pairing as DNA,
but short single -stranded RNA molecules can take a
variety of 3D shapes (tRNA, ribozymes, splicing etc…)

•Information for self-assembly

•the genetic code specifying the order of AA

•control of the beginning and ends of coding sequences

•splicing signals

•determination of the stability and its relative transcriptional

•regulation of gene expression
    What is RNA secondary structure?

• RNA secondary structure is similar to an alignment of
protein and nucleic acid sequences, except that the sequence
folds back on itself and “complementary bases” pair rather
than identical or similar bases.

• Also, an alignment of 2 or more biosequences is a statement
about an inferred evolutionary history. In contrast, “not
necessarily the sequence but structure conservation is most
important with RNA”

Main Points
•   RNA structure is dynamic in solution, i.e. constantly
    fluctuating between different folded states

•   There are many alternative structures that are nearly
    identical in energy (both predicted and actual)

•   Highly sensitive to solution conditions, e.g. salt and

•   Highly sensitive to protein binding

•   Tertiary structure (e.g. pseudoknots are important)

•   Biologically important structure may not have lowest
    predicted free energy, but it should be one of the lower
    ones - must look at sub-optimal structures
•   Three dimensional structure difficult to determine due
    to flexibility of molecule
•   Most analysis of correctness must therefore rely on
    phylogenetically determined models
•   Phylogenetic models look for invariant base pairs, but
    may not identify all unique structures
•   Structural information also comes from nuclease
    digestion studies and sometimes crosslinking
The complementary bases, C-G and A-U form stable base pairs with each
other through the creation of hydrogen bonds between donor and acceptor
sites on the bases. These are called Watson-Crick (W-C) base pairs.
In addition, we consider the weaker G-U wobble pair, where the bases bond
in a skewed fashion. All of these are called canonical base pairs.
Other base pairs occur, some of which are stable. They are called
non-canonical base pairs.
• Most common
• Biologically informative
• Difficult to compare
A computer predicted folding
of Bacillus subtilis RNase P RNA

  A circular representation of the
  B. subtilis folding.
  The nucleotides are stretched out uniformly
  along the circumference of a circle and the
  base pairs are represented by circular arcs
  that link paired bases and meet the circle at
  right angles.

  The triangular image in Figure
  is referred to as an RNA
  structure dot plot
  • Plot sequence vs. reverse complement

  • Possible stems run perpendicular to axis of symmetry
• Less common
• Is used in RNA literature
•Much easier to see similarity than “squiggles”
• Good for revealing pattern of nested stems


Stem and loop/hairpin loop

                                                       Bulge loop

   Interior loop

   .........       ..........

                   Junctions or multi-loops

.............      …         .............
                   …         Interactions Among Secondary Structures
                               Kissing Hairpins
                               Hairpin-bulge contacts
RNA structure prediction methods:

•   self-complementary regions (Dot Plot Analysis)

•   most energy stable molecules:

    Base-Pair Maximization
    Free Energy Methods

•   conserved patterns of base-pairing during

       Covariance Models
-    Energy minimization
     – dynamic programming approach
     – does not require prior sequence alignment
     – require estimation of energy terms contributing to
       secondary structure
     Comparative sequence analysis
     – Using sequence alignment to find conserved
       residues and covariant base pairs.
     – most trusted
            Development of RNA prediction

•Tinoco et al. 1971 – extrapolation from studies on small

•Pipas and McMahon, 1975 – computer programs
 estimating all possible structures in tRNAs

•Nussinov and Jacobson, 1980 – precise and efficient
 algorithm for structure predictions (two scoring matrices

•Zuker and Stiegler, 1981 – dynamic programming algorithm

•Jaeger et al., 1989, 1990, Zuker, 1994 – MFOLD

    •Set of possible structures within a given energy range
    •Indication of reliability
    •Uses covariance information

•Wuchty et al., 1999 – the partition function method
   •Viena RNA group
Self-complementary regions in RNA
Dot matrix method- search for a self-complementary regions
                          (long window, many matches)

                               Possible stems run perpendicular
                               To axis of symmetry
1.   StemLoop

StemLoop finds stems (inverted repeats) within a sequence. You specify the
minimum stem length, minimum and maximum loop sizes, and the minimum
number of bonds per stem. All stems or only the best stems can be displayed on
your screen or written into a file.

2. DotPlot

DotPlot makes a dot-plot with the
output file from Compare or StemLoop.

Calculates score over a window
Finds stems over a threshold
Minimum/maximum loopsize
Sort by position or score
Inverted repeats only
RNA Folding by Energy Minimization

The quickest and easiest route to RNA structure prediction is through the use of
simple energy rules. One way is to assign an energy to each base pair in a secondary
structure. Thate (ri, rj) is, there is a function e such that is the energy of a base pair.
The energy, E (S) , of the entire structure, is then given by:

Reasonable values of e at are -3, -2 and -1 kcal/mole for GC, AU and GU base pairs,
respectively. Unfortunately, such simple minded rules are insufficient to capture
the destabilizing effects of various loops, or the nearest neighbor interactions in helices
 and loops. More sophistication is required.

 1.     The energy associated with any position in the structure
        is influenced only by local sequence and structure

 2.    The structure is assumed to be formed by folding that
       does not produce knots
GLOBAL Energy-Minimization Methods
(Minimum-Free Energy; Maximum Enthalpy)
      Stabilization Energy (kcal/mole)
      Total Free Energy optimality criterion
      Boltzmann function-based optimality criteria
Loops/bulges introduce positive free energy and are destabilizing.

How is Stability Measured?
(A) Base pairs: Stability Introduced by Double-Stranded Regions
          Energy of paired bases are stored in a look-up table; these
vary with temperature
          Energy required by all base-pairs in a structure are summed;
this sum is the cost of the structure
(B) Stacking energies (DH Turner, Rochester)
          Stacking energies are energies added by surrounding bases.
(C) Loops (Loop Destabilizing Energies) - Instability Introduced by
Single-Stranded (Unpaired) Loops- all enthalpic
(D) Branches+ Multibranches
-each base is compared to every other base (similar to dot matrix)
-energy is estimated by nearest-neighbor rule
-complementary regions are evaluated by dynamic programming algorithm
    Energies are determined empirically

                   Energy scoring: base pairing (kcal/M)

                                      G-C     -3
                                      A-U     -2
                                      G-U     -1

                 Energy scoring: loop penalties (kcal/M)

                        Size Internal Bulge Hairpin
                         1            +3.9
                         2     +4.1 +3.1
                         3     +4.5 +3.5      +4.5
                         4     +4.9 +4.2      +5.5
                         5     +5.3 +4.8      +4.9

                  Stacking energies for base pairs
Base-pair stacking

• Favorable energies come from base-pair
stacking NOT from formation of base-pairs

• Un-paired bases make hydrogen bonds
with water therefore there is no netchange
when they pair

• Favorable interactions come from
electronic interactions between stacked

• Base-pair stacking is the ONLY favorable
energy term in RNA folding
Base comparisons                     Free energy calculations

5’   A       C       G       U       5’      A       C      G      U
A                                    A
C                                    C
G                                    G
U                                    U
-                                    -
-                                    -
G            C/G             U/G     G                             -6.4
C                    G/C             C                      -5.2
G            C/G             U/G     G               -1.8
U    A/U     C/U     G/U             U

           Stacking energies for base pairs (kcal/mole; 370C)
• Get some favorable energy even if not hydrogen bonded due to stacking, for
instance for a mismatch next to an A:U

5' AX 3’
3' UY 5'
• RNA folding is implicitly an N4 algorithm
· N2 dynamic programming to find the stems
· N2 dynamic programming to find the best combination

• Zuker algorithm is N3 due to
approximations in searching for lopsided
internal loops
· Note that very asymmetric internal loops will not be found
with the default settings
    Dynamic Programming Methods
          1   2   3                   n

    1                            ij
    2                      i+1

i                                           Energy matrix W


        Use Trace-Back methods

    •Applied by Zuker + Steigler using Energetics as the

    •StemLoop Program calculates the optimal energies of
      local stems + loops independently; based on inverted
      repeats (ignores internal loops + bulges but mFOLD
      does not).

    •Potential improvement: Determine stacking energies
       in a sliding window (e.g., 15 bp) for all possible
       15-mer ribonucleotide sequences + apply these
      (should account for local interactions)
        Zuker Algorithm
        •Calculation proceeds from center towards edges
        • Includes stacking, bulge, internal, and hairpin loop terms
        • Start from center because the center line is location of hairpin

   Limited number of alternative structures !!!!!!!!!!!!

                Vienna RNA Package

                      -alternative choices

Which regions are more/less predictive?
Reliability of secondary structure prediction

Pnum - total number of energy dotes in the i-th row and
          I-th column of the energy dot plot
        - represents the number of base pairs that the i-th
          base can form with all other base pairs within the
          defined energy range

        - the lower this value-the more well defined the local

Hnum – the sum of Pnum (i) and Pnum (j) less 1 and is the
          total number of dots in the i-th row and j-th column

        - the lower this number-the more well determined
          the double-stranded region

Ssum – the number of foldings in which base i is single-
          stranded divided by m, the number of foldings
        - represents the probability that base i is single-

        ~1-probably single stranded
        ~0-probably not
 MFold : Mfold+PlotFold
 predicts optimal and suboptimal secondary structures for an RNA or
 DNA molecule using the most recent energy minimization method of
    MFold calculates energy matrices that determine all optimal and
    suboptimal secondary structures for an RNA or DNA molecule.

     The program writes these energy matrices to an output file. A
     companion program, PlotFold, reads this output file and displays
     a representative set of optimal and suboptimal secondary
     structures for the molecule within any increment of the computed
     minimum free energy you choose.

    You can choose any of several different graphic representations for
    displaying the secondary structures in PlotFold.

    P-Num Plot
This plot shows the amount of variability in pairing
 at each position in the sequence in all predicted
foldings within the increment of the optimal folding
energy you specify.
                                                           Squiggles Plot
                                  The squiggles plot is a representation similar to what you
                                   might draw by hand; that is, bonds formed between bases are
                                   drawn as chords. Bases are shown participating in stems, as
                                   well as in hairpin, bulge, interior, and multibranched loops.
Lower left to upper right diagonals; free energy encoded by colors (dark is most optimal).
Note that some short-cut algorithms will not explore all possible structures but instead will ignore the
'blank' areas in the biplot.

                                                  Once structures are predicted they can be compared
                                                  using Structure Dot Plot:

                                                        Structure plots summarize the Commonalities
                                                        between two Predicted Structures (in this case
                                                        the top two structures).

Lower left to upper right diagonals; free energy encoded by colors (dark is most optimal).
Note that some short-cut algorithms will not explore all possible structures but instead will ignore the
'blank' areas in the biplot.

                                                  Once structures are predicted they can be compared
                                                  using Structure Dot Plot:

                                                        Structure plots summarize the Commonalities
                                                        between two Predicted Structures (in this case
                                                        the top two structures).

        • do not compute all the structures within a given energy range of
          the minimum free-energy structure
  Vienna RNA Package 1.4

• three kinds of dynamic programming algorithms for structure

    1-the minimum free energy algorithm of (Zuker & Stiegler 1981)
          which yields a single optimal structure, the partition function
    2-algorithm of (McCaskill 1990) which calculates base pair probabilities
          in the thermodynamic ensemble
    3-suboptimal folding algorithm of (Wuchty 1999) which generates
          all suboptimal structures within a given energy range of the
          optimal energy.

• For secondary structure comparison, the package contains several
measures of distance (dissimilarities) using either string alignment or tree-
editing (Shapiro & Zhang 1990).
• Finally, an algorithm is provided to design sequences with a predefined
structure (inverse folding).

    RNAfold -- predict minimum energy secondary structures and pair
    RNAeval -- evaluate energy of RNA secondary structures
    RNAheat -- calculate the specific heat (melting curve) of an RNA
    RNAinverse -- inverse fold (design) sequences with predefined
    RNAdistance -- compare secondary structures
    RNApdist -- compare base pair probabilities
    RNAsubopt -- complete suboptimal folding
Minimum free energy structure and base pair
probabilities for the Sarcin loop of 23S ribosomal
RNA, as predicted by the RNAfold program.
•Biological RNAs (with important structure) are
difficult to distinguish from random RNAs

· Same number and length of stems and loops
· Same GC content of stems
· Same free predicted free energy

•Biologically important structures are exceptional in
lacking competing structures

· this insures that the structure will be present regardless of
the net DG

• PNUM plot shows number of alternative
structures within energy increment
• Agrees well with phylogenetic predictions, but most
effective for large molecules
Sequence Covariation Methods (non-independent changes)
determined by comparing sequences among species. Joint substitutions that are
seen may reflect sites paired in the structure. Improves structure prediction by
Dynamic Programming Methods

•   for double-stranded regions in RNA molecules, sequence changes that take place
    in evolution should maintain the base pairing

•   sequence changes in loops and single-stranded regions should not have such a

You are looking for sequence positions at which covariation

                maintains the base-pairing properties

Input-group of related sequences
          Seq 1----------------G-------------C---------

          Seq 2----------------C-------------G---------

          Seq 3----------------A-------------C---------

          Seq 4----------------A-------------T---------

      GC                 CG                 AC                AT


• secondary structure prediction in RNA takes
into account conserved patterns of basepairing

• Positions of covariance are conserved
matches, since they maintain the secondary

• computationally challenging
   Eddy & Durbin (1994) – formal covariance model

           • slow
           •unsuitable for searching through large genomes
           •usually use information from already existing RNA
           secondary structure
           •How to discover this information??????

           Construct a more general model
           Train the model
           Discover the most likely base-paired regions

Similarity with HMMs

Mutual information content M superimposed on the information content of
each sequence position in an RNA alignment

Phylogeny based prediction

• Inference of structure from covariance or mutual
information depends on having the correct alignment
• Correct alignment depends on knowing the correct
• Can only find common structures, not structures
        unique to a molecule
•Can, in principle, detect pseudoknots
Interaction among base pairs   versus Context-free grammar
Interaction among base pairs   versus Context-free grammar

                        Stochastic context-free grammars
Interaction among base pairs versus Context-free grammar

                          Stochastic context-free grammars


  Terminal symbols ACGU
  Nonterminal symbols S0, S1, S2, S3,……..

COVE is an implementation of stochastic context free
grammar methods for RNA sequence/structure
                RNA world


       RNA Secondary Structure Prediction at
          Belozersky Institute, Moscow
       RNA-specifying genes


-identifies transfer RNA genes in genomic DNA or RNA sequences.
-specificity of the Cove probabilistic RNA prediction package (Eddy & Durbin, 1994)

- speed and sensitivity of tRNAscan 1.3 (Fichant & Burks, 1991)

- implementation of an algorithm described by Pavesi and colleagues (1994) which
searches for eukaryotic pol III tRNA promoters (our implementation referred to as

- tRNAscan and EufindtRNA are used as first-pass prefilters to identify ``candidate'' tRNA regions of the sequence.
These subsequences are then passed to Cove for further analysis, and output if Cove confirms the initial tRNA
prediction. In this way, tRNAscan-SE attains the best of both worlds:

- a false positive rate of less than one per 15 billion nucleotides of random sequence

- the combined sensitivities of tRNAscan and EufindtRNA (detection of 99% of true tRNAs)

- search speed 1,000 to 3,000 times faster than Cove analysis and 30 to 90 times faster than the original tRNAscan 1.3
(tRNAscan-SE uses both a code-optimized version of tRNAscan 1.3 which gives a 650-fold increase in speed, and a
fast C implementation of the Pavesi et al. algorithm).

published in Lowe & Eddy, Nucleic Acids Research 25: 955-964 (1997). .

                                                                                             NCBI CP000030
       Automatic detection of conserved RNA structure elements
             in complete RNA virus genomes
                                                                          Nucleic Acids Research, 1998, Vol. 26, No. 16
a new method for detecting conserved RNA secondary structures in a family of related RNA sequences.
Method is based on a combination of thermodynamic structure prediction and phylogenetic comparison.
In contrast to purely phylogenetic methods, our algorithm can be used for small data sets
of ~10 sequences, efficiently exploiting the information contained in the sequence variability.

(i)       Distant groups of RNA viruses have very little or no detectable sequence homology and often very
          different genomic organization
(ii)      RNA viruses show an extremely high mutation rate, of the order of 10 -5-10-3 mutations per
          nucleotide and replication.
(iii)     Due to the high sequence variation, the application of classical methods of sequence analysis
          is, therefore, difficult or outright impossible.
(iv)      The high mutation rate of RNA viruses also explains their short genomes, of less than ~20 000
          nt. A large number of complete genomic sequences is available in databases. The non-coding
          regions are most likely functionally important, since the high selection pressure acting on viral
          replication rates makes `junk RNA' very unlikely.

      RNA secondary structures are predicted as minimum energy structures by means of dynamic
      programming techniques. An efficient implementation of this algorithm is part of the Vienna RNA
Sequences are aligned using a standard multiple alignment procedure. Secondary structures for each sequence
are predicted and gaps are inserted bases in the sequence alignment. The resulting aligned structures can be
represented as aligned mountain plots. From the aligned structures consistently predicted base pairs are
identified. The alignment is used to identify compensatory mutations that support base pairs and inconsistent
 mutants that contradict pairs. This information is used to rank proposed base pairs by their credibility and to
filter the original list of predicted pairs.
 Aligned mountain representations m(k) of the RNA secondary structure of 13 complete HCV genomes.
 Peaks and plateaux in the mountain representation correspond to hairpins and unpaired regions in
 the secondary structure.

Colors indicate the number of consistent
mutations: red 1, yellow 2 and green 3 different
types of base pairs. These saturated colors indicate
that there are only compatible sequences. Decreasing
saturation of the colors indicates an increasing number
of non-compatible sequences:

Comparison of predicted minimum energy structures in region A (around position 8000) of the HCV genome.
The lower left part of the plot shows a conventional picture of the predicted structure. Base pairs marked in
green have non-consistent mutations, circles indicate compensatory mutations. The extended outer stem
contains a number of compensatory mutations supporting its existence.
The TAR structure of HIV-1. Almost all predicted base pairs are consistent with all 13 sequences, most
of them are predicted in at least 11 sequences. A large number of compensatory mutations supports the
thermodynamic predictions. Our computed consensus structure (lower left) matches the structure
determined by probing and phylogenetic reconstruction (4). We display here the consensus dot plot,
the classical secondary structure and a mountain representation. The latter is a convenient alternative
 to dot plots for larger structural motifs. Base pairs are represented by slabs connecting the two sequence
positions. The width and color of a slab corresponds to size and color of the corresponding dot plot entry.
Consensus structures of the HIV-1 RRE region from a set of 13 sequences and from the 21 sequences
Primary Structure of RNA

e.g., Human tRNAgene for Methionine
>gi|1181147|emb|Z69292.1|HSC6TRNAM H.sapiens tRNA-Met gene


   6137599 n


Shared By: