Docstoc

Molecular Evolution of Pasteurella multocida During Vaccine

Document Sample
Molecular Evolution of Pasteurella multocida During Vaccine Powered By Docstoc
					Mathematics and computation behind
      BLAST and FASTA




               Xuhua Xia
            xxia@uottawa.ca
       http://dambe.bio.uottawa.ca
              Bioinformatics-enabled research
Sequence variation:
                             Difference in             Difference in
UUCUCAACCAACCAUAAAGAUAU
UUCUCUACAAACCACAAAGACAU      1. coding sequences       1. protein abundance
UUCUCAACCAACCAUAAAGAUAU      2. Regulatory sequences   2. protein structure
UUCUCAACCAACCACAAAGACAU      3. transcription          3. cellular localization
UUCUCCACGAACCACAAAGAUAU
                             4. splicing               4. protein interaction
UUCUCUACAAACCACAAAGAUAU
                             5. translation               partners
UUCUCAACCAACCACAAAGACAU
                             6. translated sequences
UUCUCUACUAACCACAAAGACAU



                           Difference in               Difference in biochemical
                                                             function
                           1. susceptibility to
                              diseases
                           2. response to medicine
 Personalized medicine                                   Difference in phenotype
                           3. Fitness (survival and
 Conservation strategies      reproductive success)      1. morphological
 Evolutionary mechanisms                                 2. physiological
 ...                               Nurturing             3. behavioural
                                   environment
                                                                       Slide 2
             Bioinformatics research workflow
Accumulation of nucleotide and                                                                Functional genomics
amino acid sequences:                                                                         Systems biology
UUCUCAACCAACCAUAAAGAUAU                             Storage and annotation of                 Digital cells
UUCUCUACAAACCACAAAGACAU                             the sequences
UUCUCAACCAACCAUAAAGAUAU                             1.   Structural annotation
UUCUCAACCAACCACAAAGACAU
                                                         with homology search
                                                                                              Species-specific gene
                                                         and de novo gene
UUCUCCACGAACCACAAAGAUAU                                                                       dictionaries, e.g.,
                                                         prediction
UUCUCUACAAACCACAAAGAUAU
                                                                                              yeastgenome.org
                                                    2.   Functional annotation
UUCUCAACCAACCACAAAGACAU                                  with gene ontologies
UUCUCUACUAACCACAAAGACAU




                                                                                 1.   Gene/Protein families (e.g.,
                                 1.   Comparative genomics (the                       Pfam)
                                      origin of new genes, new                   2.   Cluster of orthologous genes
 Mutation                             features and new species)                       (e.g., COG)
 Selection                       2.   Phylogenetics (cladogenic                  3.   Supermatrix of gene
                                      process, dating of speciation                   presence/absence
 Adaptation                                                                      4.   Genome-based pair-wise
                                      and gene duplication events)
                                                                                      distance distributions
                                 3.   Phylogeny-based inference.
                                                                                                      Slide 3
                Why string matching?
• Efficient search against large sequence databases
• Practical significance from early applications
   – Sequence similarity between an oncogene (genes in viruses that cause
     a cancer-like transformation of the infected cells), v-sis, and the
     platelet-derived growth factor (PDGF)
       • M. D. Waterfield et al. 1983. Nature 304:35-39
       • R. F. Doolittle et al. 1983. Science 221:275-227
   – Contig assembly
   – Functional annotation by homology search

• Fast computational methods in string matching
   – FASTA
   – BLAST
   – Local pair-wise alignment by dynamic programming


                                                                   Slide 4
       Basic stats in string matching
• Given PA, PC, PG, PT in a target (database) sequence, the
  probability of a query sequence, say, ATTGCC, having a
  perfect match of the target sequence is:
  prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2

• Let M be the target sequence length and N be the query
  sequence length, the “matching operation” can be performed
  (M – N +1) times, e.g.,
  Query: ATG
  Target CGATTGCCCG

• The probability distribution of the number of matches follows
  (approximately) a binomial distribution with p = prob and n =
  (M – N +1)

                                                         Slide 5
         Basic stats in string matching
• Probability of having a sequence match: p
• Probability of having no match: q = 1-p
• Binomial distribution:
                           n!                          n!
   ( p  q) n  p n               p n 1q  ...               p n  x q x  ...  q n
                        (n  1)!1!                 (n  x)! x !
• When np > 50, the binomial distribution can be approximated
  by the normal distribution with the mean = np and variance =
  npq                    
                           ( x )               2
                              1
              P( x)             e        2 2

                             2
• When np < 1 and n is very large, binomial distribution can be
  approximated by the Poisson distribution with mean and
  variance equal to np (i.e.,  = 2 = np).
           e   x
   P( x) 
              x!
                                                                                          Slide 6
                 From Binomial to Poisson
                        n!                          n!                             n!
( p  q) n  p n               p n 1q  ...               p n  x q x ...               p x q n  x  ...  q n
                     (n  1)!1!                 (n  x)! x !                   (n  x)! x !


P ( n)  p n                                    P( x) 
                                                             n!
                                                                      p x q n x
                                                        (n  x)! x !
P (n  1)  np n 1q
                                                      n!
                  n!                                          p xq xqn
P(n  x)                  p n x q x             (n  x)! x !
             (n  x)! x !                                                                    x
                                                  n(n  1)(n  2)...(n  x  1)  p 
             n!                                                                    (1  p ) n
P( x)                p x q n x                               x!               q
         (n  x)! x !
                                                  n x x  np n x p x  np (np ) x  np         
                                                                                                   x

P (0)  q n                                         pe              e             e e
                                                  x!             x!             x!               x!



                                                                                                    Slide 7
 Matching two sequences without gap
• Assuming equal nucleotide frequencies, the probability of a
  nucleotide site in the query sequence matching a site in the
  target sequence is p = 0.25.
• The probability of finding an exact match of L letters is a = pL
  = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST.
• M: query length; N: target length, e.g., M = 8, N = 5, L = 3
  AACGGTTC
  CGGTT
• A sequence of length L can move at (M – L +1) distinct sites
  along the query and (N – L +1) distinct sites along the target.
• m = (M-L+1) and n = (N-L+1) are called effective lengths of
  the two sequences.
• The expected number of matches with length L is mn2-S,
  which is called E-value in ungapped BLAST.
• S is calculated differently in the gapped BLAST
                                                            Slide 8
                  Blast Output (Nuc. Seq.)
BLASTN 2.2.4 [Aug-26-2002]
...
Query= Seq1 38
Database: MgCDS
           480 sequences; 526,317 total letters
                                                       Score    E
Sequences producing significant alignments:            (bits) Value
MG001 1095 bases                                        34   7e-004
 Score = 34.2 bits (17), Expect = 7e-004
 Identities = 35/40 (87%), Gaps = 2/40 (5%)
                                                                                 Typically one would
Query: 1   atgaataacg--attatttccaacgacaaaacaaaaccac 38
                                                                                 count only 1 GE here.
           |||||||||| ||||||||||| |||||| ||||||||
Sbjct: 1   atgaataacgttattatttccaataacaaaataaaaccac 40
                                              Matches: 35*1 = 35
Lambda        K        H
                                              Mismatches: 3*(-3) = -9
  1.37    0.711     1.31
                                              Gap Open: 1*5 = 5
Matrix: blastn matrix:1 -3
                                              Gap extension: 2*2 =4
Gap Penalties: Existence: 5, Extension: 2
                                              R = 35 - 9 - 5 - 4 = 17
…
                                              S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34
effective length of query: 26
                                              E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04
effective length of database: 520,557
                                              x             p(x)
                       e E ( E ) x           0             0.999265217
              p ( x)                         1             0.000734513
                           x!                 2             0.000000270
                                                                                      Slide 9
                                              3             0.000000000
              E-Value in BLAST
• The e-value is the expected number of random
  matches that is equally good or better than the
  reported match. It can be a number near zero or much
  larger than 1.
• It is NOT the probability of finding the reported
  match.
• Only when the e-value is extremely small can it be
  interpreted as the probability of finding 1 match that
  is as good as the reported one (see next slide).


                                                   Slide 10
                            E-value and P(1)
                      e E ( E ) x
             p ( x) 
                          x!
              p(1)  e E E  E ( when E  0)

        1
       0.9
       0.8
       0.7
       0.6
P(1)




       0.5
       0.4
       0.3
       0.2
       0.1
         0
          0.00               0.20    0.40             0.60   0.80   1.00
                                            E-value


                                                                    Slide 11
                    Gapped BLAST
• Adapted from Crane & Raymer 2003
• Input sequence: AILVPTVIGCTVPT
• Algorithm:
   – Break the query sequence into words:
     AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT,
     GCTV, CTVP, TVPT
   – Discard common words (i.e., words made entirely of common amino
     acids)
   – Search for matches against database sequences, assess significance and
     decide whether to discard to continue with extension using dynamic
     programming:
                          AILVPTVIGCTVPT
     MVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC


                                                                   Slide 12
                      BLAST Programs
Program     Database     Query        Typical Uses
BLASTN/ME   Nucleotide   Nucleotide   MEGABLAST has longer word size than
GABLAST                               BLASTN
BLASTP      Protein      Protein      Query a protein/peptide against a protein
                                      database.
BLASTX      Protein      Nucleotide   Translate a nuc sequence into a “protein”
                                      in six frames and search against a protein
                                      database
TBLASTN     Nucleotide   Protein      Unannotated nuc sequences (e.g., ESTs)
                                      are translated in six frames against which
                                      the query protein is searched
TBLASTX     Nucleotide   Nucleotide   6-frame translation of both query and
                                      database
PHI-BLAST   Protein      Protein      Pattern-hit iterated BLAST
PSI-BLAST   Protein      Protein      Position-specific iterated BLAST
RPS-BLAST   Protein      Protein      Reverse PSI-BLAST
                                                                     Slide 13
                      FASTA
• Another commonly used family of alignment and
  search tools
• Generally considered to be more sensitive than
  BLAST.
• Illustration with two fictitious sequences used in the
  Contig Assembly lecture:
  Seq1: ACCGCGATGACGAATA
  Seq2: GAATACGACTGACGATGGA

  Seq1: ACCGCGATGACGAATA
  Seq2:            GAATACGACTGACGATGGA

                                                   Slide 14
                      String Match in FASTA
          1   2   3   4     5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  Left   Right
Query     A   C   C   G     C G A T G A C G A A T A                 Move N Move N
Target    G   A   A   T     A C G A C T G A C G A T G G A              -1 3    1 6
                                                                       -2 5    2 7
          A C G    T                                                   -3 1    3 3
          1  2  4  8                                                   -4 3    4 3
          7  3  6 15                                                   -5 7    5 6
         10  5  9                                                      -6 1    6 3
         13 11 12                                                      -7 1    7 3
         14                                                            -8 4    8 5
         16                                                            -9 1    9 2
                                                                      -10 1   10 2
           1   2   3  4     5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   -11 5   11 3
           G A A      T     A C G A C T G A C G A T G G A             -12 1   12 2
          -3   1   2 -4     4 4 3 7 7 2 7 11 11 10 14 8 13 14 18      -13 1   13 1
          -5 -5 -4 -11     -2 3 1 1 6 -5 5 5 10 8 8 1 11 12 12        -14 1   14 2
          -8 -8 -7         -5 1 -2 -2 4    2 2 8 5 5        8 9 9     -15 0   15 0
         -11 -11 -10       -8 -5 -5 -5 -2 -1 -1 2 2 2       5 6 6             16 0
             -12 -11       -9       -6       -2       1           5           17 0
             -14 -13      -11       -8       -4      -1           3           18 1
                  Left and Right: -n means moving the query left by n
                  sites and n means moving the query right by n sites.    Slide 15
         Alternative Matched Strings
 Forw.  Back
Move N Move N   Query: ACCGCGATGACGAATA
   -1 3   1 6   Target:GAATACGACTGACGATGGA
   -2 5   2 7
   -3 1   3 3   From lecture on contig assembly:
   -4 3   4 3                                         One of the
   -5 7   5 6                                         three 3rd best
                Query: ACCGCGATGACGAATA
   -6 1   6 3   Target:           GAATACGACTGACGATGGA
   -7 1   7 3
   -8 4   8 5
                From FASTA algorithm:
   -9 1   9 2
  -10 1  10 2                                      Best
                Query:    ACCGCGATGACGAATA
  -11 5  11 3
  -12 1  12 2
                Target: GAATACGACTGACGATGGA
  -13 1  13 1                                             2nd best
  -14 1  14 2   Query: ACCGCGATGACGAATA
  -15 0  15 0   Target:     GAATACGACTGACGATGGA
         16 0
         17 0   Which one is best based on YOUR judgment?
         18 1
                                                           Slide 16
                              Word length of 2
          1   2   3   4   5   6   7   8   9   10   11 12   13   14 15 16   17   18  Left        Right
Query     A   C   C   G   C   G   A   T   G    A    C G     A    A T A             Move    N   Move N
Target    G   A   A   T   A   C   G   A   C    T    G A     C    G A T       G G A    -1   1       1 3
                                                                                      -2   2       2 5
         AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT                              -3   0       3 1
         13  1     7     2   3     6 4       15     8                                 -4   1       4 1
            10    14         5     9                                                  -5   4       5 2
                            11    12                                                  -6   0       6 1
                                                                                      -7   0       7 1
                                                                                      -8   1       8 4
                                                                                      -9   0       9 1
                                                                                     -10   0      10 1
           1   2   3  4  5  6  7  8 9 10 11 12 13 14 15 16                  17 18    -11   4      11 1
         GA AA AT TA AC CG GA AC CT TG GA AC CG GA AT TG                   GG GA     -12   0      12 1
          -5 -11 -4 -11  4  3  1  7    2  5 11 10  8 8 8                       12    -13   0      13 0
          -8     -11    -5  1 -2 -2       2  2  8  5 1                          9    -14   0      14 0
         -11               -5 -5         -1     2  2                            6                 15 0
                                                                                                  16 0
                                                                                                  17 0
Query: ACCGCGATGACGAATA
                                                                      One of the
Target:           GAATACGACTGACGATGGA
                                                                      three 2nd best
Query:    ACCGCGATGACGAATA
Target: GAATACGACTGACGATGGA                        Best
                                                                                       Slide 17
   Comparison: BLAST and FASTA
• BLAST starts with exact string matching, while
  FASTA starts with inexact string matching (or exact
  string matching with a shorter words). BLAST is
  faster than FASTA.
• For the examples given, both BLAST and FASTA
  will find the same best match, i.e., shifting the query
  sequence by 2 sites to the right.
• Both perform dynamic programming for extending
  the match after the initial match.


                                                    Slide 18

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:30
posted:8/19/2012
language:English
pages:18