Sequence Alignments

Document Sample
Sequence Alignments Powered By Docstoc
					Alignments, Matrices &
Markov Models
              Chris Bailey
Bacterial Pathogenesis & Genomics Unit
          cmb036@bham.ac.uk
The Matrix
 • When analysing nucleotide sequences:
   – Nucleotides match (G=G) …
   – Or they don’t (G≠C)
 • One nucleotide substitution is no more
   relevant than another
 • (Except in the light of what that nucleotide
   sequence ends up coding for)
The Matrix
 • But in proteins:
   – Some amino acids are readily substitutable (i.e.
     1 hydrophobic residue for another)
   – And others are badly tolerated (i.e. 1
     hydrophobic residue for a charged residue)
 • Assuming you want the 2º & 3º structure to
   be the same
Accommodating for Change
 • So we adapt our scoring function (s)
 • So that scores for matches and mismatches
   take account of amino acid substitutability
 • We do this using protein substitution
   matrices
What’s a matrix
 • Lookup grid (20x20), with each row/column
   containing an amino acid

                 Cys Gly Pro Ser        Trp
           Cys
           Gly
           Pro
           Ser

           Trp
What’s a Matrix
 • Values in the matrix are scores of one
   amino acid vs another
 • For example:
   – Big positive score: amino acid 1 is readily
     substitutable for amino acid 2
   – Big negative score: amino acid 1 is not
     substitutable for amino acid 3
Part of a Matrix

             Cys   Gly   Pro   Ser   Ala   Trp
       Cys   12    -3    -3    0     -2    8
       Gly   -3    5     -1    1     1     -7
       Pro   -3    -1    6     1     1     -6
       Ser   0     1     1     1     1     -2
       Ala   -2    1     1     1     2     -6
       Trp   -8    -7    -6    -2    -6    17
How do we fill the table?
 • Obvious what the values should represent
   (protein similarity)
 • But how do we calculate the actual numbers
 • Since we are looking for homology an
   evolutionary model is a good place to start
Why look at evolution?
 • Homology is a binary property
 • Similarity ≠ Homology
 • Homology indicates two proteins had a
   common ancestor
The PAM Matrix
 • PAM = Point Accepted Mutation
 • Construct 71 phylogenetic trees of protein
   families
 • Observe amino acid substitutions on each
   branch of tree
 • Also need probability of occurrence for each
   amino acid (pa)
The PAM Matrix
 • Using substitution data calculate fab the
   observed frequency of the mutation a ↔ b
 • Also note that fab = fba

 • Using this information calculate fa, the total
   number of mutations in which a involved
The PAM Matrix
 • Using substitution data calculate fab the
   observed frequency of the mutation a ↔ b
 • Also note that fab = fba

 • Using this information calculate fa, the total
   number of mutations in which a involved
                     fa   fab
                         ba
The PAM Matrix
 • And also calculate f, the total occurences of
   amino acid substitutions
The PAM Matrix
 • And also calculate f, the total occurences of
   amino acid substitutions

                    f   fa
                          a
The PAM Matrix
 • And also calculate f, the total occurences of
   amino acid substitutions

                    f   fa
                          a


 • From here we go on to calculate relative
   mutability:
The PAM Matrix
 • And also calculate f, the total occurences of
   amino acid substitutions

                    f   fa
                          a


 • From here we go on to calculate relative
   mutability:
                        fa
                ma 
                     100 f pa
The PAM Matrix
 • Relative mutability: Probability that a given
   amino acid will change in the evolutionary
   period of interest
 • Now we calculate the matrix …
The PAM Matrix
 • 20 x 20 Matrix where Mab is the probability
   of amino acid a changing into amino acid b
 • Maa = 1 – ma
 • Mab is more complicated & requires
   conditional probability
   – E.g. P(A and B) = P(A)∙P(B|A)
The PAM Matrix
 • In this case:
The PAM Matrix
 • In this case:
        Mab  P(a  b | a changed) P(a changed)
The PAM Matrix
 • In this case:
         Mab  P(a  b | a changed) P(a changed)
 • Or:
               fab
         Mab      ma
               fa
The PAM Matrix
 • These equations allow us to calculate a
   PAM1 matrix
 • The number after PAM is the number of
   amino acid substitutions per 100 residues:
   – PAM40 – 40 substitutions per 100 residues
   – PAM250 – 250 substitutions per 100 residues


 • All matrices calculated by multiplication of
   PAM1 matrix
The PAM matrix
 • The final scores in a PAM matrix are
   expressed as a lod (logarithm of odds) score
 • Compare probability of mutation vs probability
   of random occurrence
The PAM matrix
 • The final scores in a PAM matrix are
   expressed as a lod (logarithm of odds) score
 • Compare probability of mutation vs probability
   of random occurrence
 • Gives odds ratio: Mab
                      pb
The PAM matrix
 • The final scores in a PAM matrix are
   expressed as a lod (logarithm of odds) score
 • Compare probability of mutation vs probability
   of random occurrence
 • Gives odds ratio: Mab
                      pb
 • Scoring Matrix S is calculated by:
The PAM matrix
 • The final scores in a PAM matrix are
   expressed as a lod (logarithm of odds) score
 • Compare probability of mutation vs probability
   of random occurrence
 • Gives odds ratio: Mab
                      pb
 • Scoring Matrix S is calculated by:
                                Mab 
                                pb 
                Sab  10 log 10     
                                    
The full PAM 250 matrix
   C   12
   S    0   2
   T   -2   1    3
   P   -3   1    0    6
   A   -2   1    1    1    2
   G   -3   1    0    -1   1    5
   N   -4   1    0    -1   0    0    2
   D   -5   0    0    -1   0    1    2    4
   E   -5   0    0    -1   0    0    1    3    4
   Q   -5   -1   -1   0    0    -1   1    2    2    4
   H   -3   -1   -1   0    -1   -2   2    1    1    3    6
   R   -4   0    -1   0    -2   -3   0    -1   -1   2    2    8
   K   -5   0     -   -1   -1   -2   1    0    0    2    0    3    5
   M   -5   -2   -1   -2   -1   -3   -2   -4   -2   -2   -2   0    0    6
   I   -2   -1   0    -2   -1   -3   -2   -3   -2   -3   -2   -2   -2   2    5
   L   -8   -3   -2   -3   -2   -4   -3   -4   -3   -3   -2   -3   -3   4    2    8
   V   -2   -1   0    -1   0    -1   -2   -3   -2   -3   -2   -2   -2   2    4    2    4
   F   -4   -3   -3   -5   -4   -5   -4   -6   -5   -5   -2   -4   -5   0    1    2    -1   9
   Y    0   -3   -3   -5   -3   -5   -2   -4   -4   -4   0    -4   -4   -2   -1   -1   -2   7   10
   W   -8   -2   -5   -6   -6   -7   -4   -7   -7   -5   -3   2    -3   -4   -5   -2   -6   0    0   17
       C    S    T    P    A    G    N    D    E    Q    H    R    K    M     I   L    V    F   Y    W
The BLOSUM Matrix
 • Derived using similar mathematical
   principals as a PAM matrix
 • However substitution data is derived in a
   different manner
The BLOSUM Matrix
 • Substitution data comes from multiple
   alignments.
 • Straightforward count of number of
   substitutions in the alignement
 • Number after BLOSUM (e.g. 62) denotes
   the minimum level (as a %age) of similarity
   between sequences within the alignments.
PAM Vs BLOSUM
•   PAM100 ↔ BLOSUM90
•   PAM120 ↔ BLOSUM80
•   PAM160 ↔ BLOSUM60
•   PAM200 ↔ BLOSUM52
•   PAM250 ↔ BLOSUM45
BLAST
• The Basic Local Alignment Search Tool
• It is a Heuristic (i.e. it does not guarantee
  optimal results)
• Why …
BLAST
• Genbank currently stores about 1010
  residues of protein data.
• Trying to form alignments against such a
  huge database is unfeasible (even with vast
  computing power)
• So we need to shortcut
How BLAST Works
• 3 steps:
  – Compile a list of High Scoring words
  – Search for hits – hits give seeds
  – Extend seeds
Step 1 - Preprocessing
 • BLAST creates a series of words, of length
   W where
 • W is 2..4 for proteins
 • W is >10 for DNA
 • These words are based on subsequences
   of the query (Q)
Step 1 - Preprocessing
 • Get each subsequence of Q, that is of
   length W
 • E.g.
           LVNRKPVVP
           LVN
            VNR
             NRK
               RKP
                etc…
Step 1 - Preprocessing
 •   For each of these words, find similar words
 •   How does blast define similarity:
 •   Uses scoring matrices to score 2 words
 •   E.g.
        W1 R K P
        W2 R R P
            9 -1 7 Total = 15
Step 1 - Preprocessing
 • Words are similar if their score is greater
   than a value T (T = 12 usually)
 • For RKP the following are examples of high
   scoring words
   – QKP, KKP, RQP, REP, RRP, RKP
Step 2 – Looking for Hits
 • Formatdb create a hash lookup table
 • E.g. ‘KKP’  12054345, 23451635,
   23452152
 • Maps each word to an entry in the database
   of proteins
 • Allows us to retrieve sequences which has a
   word match in constant time
Step 3 – Extending the matches
 • Starting at the word match, extend the
   alignment in both directions
 • Alignment scored using an adapted smith-
   waterman algorithm
 • Alignment stopped once score has dropped
   below the specified threshold
Multiple alignments
 • The problem:
 • Pairwise alignment requires n2 time
 • Multiple alignment requires nx where x is the
   number of sequences to align
Progressive alignment
 • Pairwise alignment of each combination of
   sequence pairs (requires xn2 time)
 • Use alignment scores to produce dendogram
   using neighbour-joining method
 • Align sequences sequentially, using
   relationships from tree
Progressive alignment
 • For 5 sequences: align 1v2, 1v3 … 4v5
 • Make a tree:
                       1
                       3
                       4
                       5
                       2


      1
      3
      4
      5
      2
Progressive alignment
            1
            3

1.


            4
2.          5

            1
            3
            4
3.          5

            1
            3
            4
4.          5
            2
Hidden Markov models
 • Prediction tool
 • Two related things X and Y
 • Using information about X and Y to build
   model
 • Predict Y using X or vice versa
Markov Models
 • Defines a series of states and the
   probability of moving to another state
 • Eg. The weather



       Sun            Cloud           Rain
Markov Models
 • Example probabilities:

                            Weather Today
                            Sun     Cloud   Rain
                 Sun        0.5     0.25    0.25

 Weather       Cloud    0.375       0.125   0.375
 Yesterday
                Rain    0.125       0.625   0.375


 • This is a state transition matrix (A)
Markov Models
 • Initialise the system use a vector of initial
   probabilities ( vector)
 • In this case the type of weather on day 1.
Hidden Markov Models
 • Use something observable to predict
   something hidden
 • E.g. Barometric pressure and the weather (if
   you couldn’t see outside)
 • Every observable state is connected to
   every hidden state
Hidden Markov Models


Observable   28” Hg   29” Hg           30” Hg   31” Hg




Hidden
               Sun             Cloud            Rain
Hidden Markov Models
 • Also need a probability matrix for connections
   between observable & hidden states (confusion
   matrix) (B)

                           Weather Today

                     28”      29”     30”      31”
              Sun 0.60        0.20    0.15    0.05
 Weather     Cloud 0.25       0.25    0.25    0.25
 Yesterday
              Rain 0.05       0.10    0.35    0.50
Hidden Markov Models
 • You can do many different things using
   HMMs
   – Match a series of observation to a HMM
     (Evaluation)
   – Determine the hidden sequence most likely to
     have generated a sequence of observations
     (Decoding)
   – Determining the parameters (, A and B) most
     likely to have generates a sequence of
     observations (Learning)
Hidden Markov models
 • There are a series of algorithms that you
   can use to
   – Evaluate (Forward algorithm)
   – Decode (Viterbi algorithm)
   – Learn (Forward-Backward algorithm)
Hidden Markov Models