VIEWS: 4 PAGES: 53 POSTED ON: 4/16/2011 Public Domain
Alignments, Matrices & Markov Models Chris Bailey Bacterial Pathogenesis & Genomics Unit cmb036@bham.ac.uk The Matrix • When analysing nucleotide sequences: – Nucleotides match (G=G) … – Or they don’t (G≠C) • One nucleotide substitution is no more relevant than another • (Except in the light of what that nucleotide sequence ends up coding for) The Matrix • But in proteins: – Some amino acids are readily substitutable (i.e. 1 hydrophobic residue for another) – And others are badly tolerated (i.e. 1 hydrophobic residue for a charged residue) • Assuming you want the 2º & 3º structure to be the same Accommodating for Change • So we adapt our scoring function (s) • So that scores for matches and mismatches take account of amino acid substitutability • We do this using protein substitution matrices What’s a matrix • Lookup grid (20x20), with each row/column containing an amino acid Cys Gly Pro Ser Trp Cys Gly Pro Ser Trp What’s a Matrix • Values in the matrix are scores of one amino acid vs another • For example: – Big positive score: amino acid 1 is readily substitutable for amino acid 2 – Big negative score: amino acid 1 is not substitutable for amino acid 3 Part of a Matrix Cys Gly Pro Ser Ala Trp Cys 12 -3 -3 0 -2 8 Gly -3 5 -1 1 1 -7 Pro -3 -1 6 1 1 -6 Ser 0 1 1 1 1 -2 Ala -2 1 1 1 2 -6 Trp -8 -7 -6 -2 -6 17 How do we fill the table? • Obvious what the values should represent (protein similarity) • But how do we calculate the actual numbers • Since we are looking for homology an evolutionary model is a good place to start Why look at evolution? • Homology is a binary property • Similarity ≠ Homology • Homology indicates two proteins had a common ancestor The PAM Matrix • PAM = Point Accepted Mutation • Construct 71 phylogenetic trees of protein families • Observe amino acid substitutions on each branch of tree • Also need probability of occurrence for each amino acid (pa) The PAM Matrix • Using substitution data calculate fab the observed frequency of the mutation a ↔ b • Also note that fab = fba • Using this information calculate fa, the total number of mutations in which a involved The PAM Matrix • Using substitution data calculate fab the observed frequency of the mutation a ↔ b • Also note that fab = fba • Using this information calculate fa, the total number of mutations in which a involved fa fab ba The PAM Matrix • And also calculate f, the total occurences of amino acid substitutions The PAM Matrix • And also calculate f, the total occurences of amino acid substitutions f fa a The PAM Matrix • And also calculate f, the total occurences of amino acid substitutions f fa a • From here we go on to calculate relative mutability: The PAM Matrix • And also calculate f, the total occurences of amino acid substitutions f fa a • From here we go on to calculate relative mutability: fa ma 100 f pa The PAM Matrix • Relative mutability: Probability that a given amino acid will change in the evolutionary period of interest • Now we calculate the matrix … The PAM Matrix • 20 x 20 Matrix where Mab is the probability of amino acid a changing into amino acid b • Maa = 1 – ma • Mab is more complicated & requires conditional probability – E.g. P(A and B) = P(A)∙P(B|A) The PAM Matrix • In this case: The PAM Matrix • In this case: Mab P(a b | a changed) P(a changed) The PAM Matrix • In this case: Mab P(a b | a changed) P(a changed) • Or: fab Mab ma fa The PAM Matrix • These equations allow us to calculate a PAM1 matrix • The number after PAM is the number of amino acid substitutions per 100 residues: – PAM40 – 40 substitutions per 100 residues – PAM250 – 250 substitutions per 100 residues • All matrices calculated by multiplication of PAM1 matrix The PAM matrix • The final scores in a PAM matrix are expressed as a lod (logarithm of odds) score • Compare probability of mutation vs probability of random occurrence The PAM matrix • The final scores in a PAM matrix are expressed as a lod (logarithm of odds) score • Compare probability of mutation vs probability of random occurrence • Gives odds ratio: Mab pb The PAM matrix • The final scores in a PAM matrix are expressed as a lod (logarithm of odds) score • Compare probability of mutation vs probability of random occurrence • Gives odds ratio: Mab pb • Scoring Matrix S is calculated by: The PAM matrix • The final scores in a PAM matrix are expressed as a lod (logarithm of odds) score • Compare probability of mutation vs probability of random occurrence • Gives odds ratio: Mab pb • Scoring Matrix S is calculated by: Mab pb Sab 10 log 10 The full PAM 250 matrix C 12 S 0 2 T -2 1 3 P -3 1 0 6 A -2 1 1 1 2 G -3 1 0 -1 1 5 N -4 1 0 -1 0 0 2 D -5 0 0 -1 0 1 2 4 E -5 0 0 -1 0 0 1 3 4 Q -5 -1 -1 0 0 -1 1 2 2 4 H -3 -1 -1 0 -1 -2 2 1 1 3 6 R -4 0 -1 0 -2 -3 0 -1 -1 2 2 8 K -5 0 - -1 -1 -2 1 0 0 2 0 3 5 M -5 -2 -1 -2 -1 -3 -2 -4 -2 -2 -2 0 0 6 I -2 -1 0 -2 -1 -3 -2 -3 -2 -3 -2 -2 -2 2 5 L -8 -3 -2 -3 -2 -4 -3 -4 -3 -3 -2 -3 -3 4 2 8 V -2 -1 0 -1 0 -1 -2 -3 -2 -3 -2 -2 -2 2 4 2 4 F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W The BLOSUM Matrix • Derived using similar mathematical principals as a PAM matrix • However substitution data is derived in a different manner The BLOSUM Matrix • Substitution data comes from multiple alignments. • Straightforward count of number of substitutions in the alignement • Number after BLOSUM (e.g. 62) denotes the minimum level (as a %age) of similarity between sequences within the alignments. PAM Vs BLOSUM • PAM100 ↔ BLOSUM90 • PAM120 ↔ BLOSUM80 • PAM160 ↔ BLOSUM60 • PAM200 ↔ BLOSUM52 • PAM250 ↔ BLOSUM45 BLAST • The Basic Local Alignment Search Tool • It is a Heuristic (i.e. it does not guarantee optimal results) • Why … BLAST • Genbank currently stores about 1010 residues of protein data. • Trying to form alignments against such a huge database is unfeasible (even with vast computing power) • So we need to shortcut How BLAST Works • 3 steps: – Compile a list of High Scoring words – Search for hits – hits give seeds – Extend seeds Step 1 - Preprocessing • BLAST creates a series of words, of length W where • W is 2..4 for proteins • W is >10 for DNA • These words are based on subsequences of the query (Q) Step 1 - Preprocessing • Get each subsequence of Q, that is of length W • E.g. LVNRKPVVP LVN VNR NRK RKP etc… Step 1 - Preprocessing • For each of these words, find similar words • How does blast define similarity: • Uses scoring matrices to score 2 words • E.g. W1 R K P W2 R R P 9 -1 7 Total = 15 Step 1 - Preprocessing • Words are similar if their score is greater than a value T (T = 12 usually) • For RKP the following are examples of high scoring words – QKP, KKP, RQP, REP, RRP, RKP Step 2 – Looking for Hits • Formatdb create a hash lookup table • E.g. ‘KKP’ 12054345, 23451635, 23452152 • Maps each word to an entry in the database of proteins • Allows us to retrieve sequences which has a word match in constant time Step 3 – Extending the matches • Starting at the word match, extend the alignment in both directions • Alignment scored using an adapted smith- waterman algorithm • Alignment stopped once score has dropped below the specified threshold Multiple alignments • The problem: • Pairwise alignment requires n2 time • Multiple alignment requires nx where x is the number of sequences to align Progressive alignment • Pairwise alignment of each combination of sequence pairs (requires xn2 time) • Use alignment scores to produce dendogram using neighbour-joining method • Align sequences sequentially, using relationships from tree Progressive alignment • For 5 sequences: align 1v2, 1v3 … 4v5 • Make a tree: 1 3 4 5 2 1 3 4 5 2 Progressive alignment 1 3 1. 4 2. 5 1 3 4 3. 5 1 3 4 4. 5 2 Hidden Markov models • Prediction tool • Two related things X and Y • Using information about X and Y to build model • Predict Y using X or vice versa Markov Models • Defines a series of states and the probability of moving to another state • Eg. The weather Sun Cloud Rain Markov Models • Example probabilities: Weather Today Sun Cloud Rain Sun 0.5 0.25 0.25 Weather Cloud 0.375 0.125 0.375 Yesterday Rain 0.125 0.625 0.375 • This is a state transition matrix (A) Markov Models • Initialise the system use a vector of initial probabilities ( vector) • In this case the type of weather on day 1. Hidden Markov Models • Use something observable to predict something hidden • E.g. Barometric pressure and the weather (if you couldn’t see outside) • Every observable state is connected to every hidden state Hidden Markov Models Observable 28” Hg 29” Hg 30” Hg 31” Hg Hidden Sun Cloud Rain Hidden Markov Models • Also need a probability matrix for connections between observable & hidden states (confusion matrix) (B) Weather Today 28” 29” 30” 31” Sun 0.60 0.20 0.15 0.05 Weather Cloud 0.25 0.25 0.25 0.25 Yesterday Rain 0.05 0.10 0.35 0.50 Hidden Markov Models • You can do many different things using HMMs – Match a series of observation to a HMM (Evaluation) – Determine the hidden sequence most likely to have generated a sequence of observations (Decoding) – Determining the parameters (, A and B) most likely to have generates a sequence of observations (Learning) Hidden Markov models • There are a series of algorithms that you can use to – Evaluate (Forward algorithm) – Decode (Viterbi algorithm) – Learn (Forward-Backward algorithm) Hidden Markov Models