VIEWS: 52 PAGES: 66 POSTED ON: 11/8/2009 Public Domain
Project Biol 602 Hidden Markov Models and multiple sequence alignment Mummals • Multiple sequence alignment improved by using hidden Markov Models with local structural information • By Jimin Pei and Nick V. Grishin • Originally published online August 26 2006 in Nucleic Acids Research Introduction to Hidden Markov Models • Demonstrate Hidden Markov Models by using the Fair Bet Casino Problem • Use a fair coin and a biased coin to represent two states. • The observations will be the results of the coin tosses. • Demonstrate a threshold as to when the observation is the result of using a biased coin or a fair coin Find the probability of the Casino problem • The state will consist of the type of coin used (Fair or Biased) and the observations will be the result of the coin tosses (Heads or tails) • The path is the sequence of states. • Apply basic probability to find the probability of the observations given the type of coin used. Decoding Problem Find the optimal hidden path of states given the observations Input: sequence of observations x1,x2,…xn generated by a hidden markov model given the alphabet (Heads or Tails), the probability of moving between states (Fair to Biased) and the emission probabilities (P(head)/biased). Output • A path that maximizes Prob (observed sequences/Path) over all possible paths. • Use the Viterbi Algorithm that finds the most probable path through the hidden States • Use the Forward and Backward algorithm to estimate what a state is at a given point Motivation • Take a sequence and compare it to a set of states maybe two several states of DNA that could possibly represent families of genes or types of genes. • Determine which family the sequence best matches up with. • Determine if a sequence is associated with a gene region or an intergenic regions Viterbi Algorithm • Given sequence of observations and an HMM • Use the Viterbi Algorithm to determine the likely path through the states of the underlying Markov Chain Hidden Markov Model • • • • • • • Consists of States Ex B,G1,G2,G3,E State Transition Probabilities Ex P(G1->G2) Alphabet Ex Q{A.B} … observations Emission Probabilities of the observations of from each state • • • • • • • • • • • • • • • • • • P(B->G1)=.2 P(B->G2)=.3 P(B->G3)=.5 P(G1->E)=.1 P(G2->E)=.2 P(G3->E)=.4 P(G1->G1)=.3 P(G1->G3)=.3 P(G2->G2)=.4 P(G2->G3)=.4 P(G3->G2)=.3 Q={A,B} q1(A)=.5 q1(B)=.5 q2(A)=.1 q2(B)=.9 q3(A)=.9 q3(B)=.1 X=BAB Hidden Markov Model G1 G2 B E G3 Calculations (.2)(.5) (.01)(.3)(.5)=.0015 (.27)(0)(.5)=0 (.05)(0)(.5)=0 (.01)(.3)(.1)=.0003 (.27)(.4)(.1)=.0108 (.05)(.3)(.1)=.0015 ( .0015)(.3)(.5)=.00025 (.0108)(0)(.5)=0 (.0972)(0)(.5)=0 (.0015)(.3)(.9)=.00045 (.0108)(.4).9)=.003888 (.0972)(.3)(.9)=.026244 2.25E-4 (.3)(.9)=.27 .0052488 (.5)(.1)=.05 (.01)(.3)(.9)=.0027 (.0015)(.3)(.1)=4.5E-5 (.27)(.4)(.9)=.0972 (,0108)(.4)(.1)=4,32E-4 (.05)(.3)(.9)=.0135 (.0972)(.3)(.1)=.00296 .0011664 Conclusion: tracing back the dynamic programming routine gives the most probable path as : G2G3G2 Notice that there are three links to each state but only the maximum is used Formal definition • A set Q of states • An alphabet • Transition probability ast for every two states s,t • Emission probability ek(b) for every alphabet symbol b and state k (the probability of “emitting” symbol b in state k) Viterbi Algorithm • Given the Hidden Markov Model • From the observations determine the what the most probable state path is underlying the observations. • This is a dynamic programming algorithm Find the probability of the sequence • Use the forward algorithm to compute the total probability • Use the backward algorithm to compute the total probability Forward Algorithm • Given the previous HMM • What is the probability that the sequence • X=AAB is generated? Find the probabilities of the first given observation f1(1) represents the probability of the first observation ending at state 1 f1(1)=p(01)q1(x1)=(.2)(.5)=.10 f2(1) represents the probability of the first observation ending at state 2 f2(1)=p(02)q2(x1)=(.3)(.1)=.03 f3(1) represents the probability of the first observation ending at state 3 f3(1)=p(03)q3(x1)=(.5)(.9)=.45 Calculate the probability of the first 2 observations ending at state 1 • f1(2) • =p(01)q1(x1)p11*q1(x2) + p(02)q2(x1)p(21)*q1(x2) + p(03)q3(x1)p(31)*q1(x2) • • • • • • =q1(x2)[p(01)q1(x1)p11 + p(02)q2(x1)p(21) +p(03)q3(x1)p(31)] =.5[.2*.5*.3+.3*.1*0+.5*.9*0] =.5[.10*.3 + .03*0 + .45 * 0] =.015 Notice that f1(2)=q1(x2)[f1(1)p(11) + f2(1)p21 + f3(1) p31] Calculate the probability of the first 2 observations ending at state 2 • f2(2) • =p(01)q1(x1)p12*q2(x2) + p(02)q2(x1)p(22)*q2(x2) + p(03)q3(x1)p(32)*q2(x2) • • • • • • • =q2(x2)[p(01)q1(x1)p12 + p(02)q2(x1)p(22) +p(03)q3(x1)p(32)] =.1[.2*.5*.3 + .3*.1*.4 +.5*.9*.3] =.1[.10*.3 + .03 * .4 + .45 * .3] =.1[ .03 + .012 + .135] = .0177 Notice that f2(2)=q2(x2)[f1(1)p(12) + f2(1)p22 + f3(1) p32] Calculate the probability of the first 2 observations ending at state 3 • f3(2) • =p(01)q1(x1)p13*q3(x2) + p(02)q2(x1)p(23)*q3(x2) + p(03)q3(x1)p(33)*q3(x2) • • • • • • • =q3(x2)[p(01)q1(x1)p13 + p(02)q2(x1)p(23) +p(03)q3(x1)p(33)] =.9[.2*.5*.3 + .3*.1*.4 +.5*.9*.3] =. 9[.10*.3 + .03 * .4 + .45 * .3] =.9[ .03 + .012 + .135] = .1593 Notice that f3(2)=q3(x2)[f1(1)p(13) + f2(1)p23 + f3(1) p33] Calculate the probability of all three observations ending at State 1 • The following paths represent the total possible paths ending at state 1. After arriving at state 1 then you must multiply the result by q1(x3) which in our case is q1(„B‟). The observations are „AAB‟. This is given. • f1(2)*p11 f2(2)*p21 f3(2)*p31 • 01 11 11 01 12 21 01 13 31 • 02 21 11 02 22 21 02 13 31 • 03 31 11 03 32 21 03 13 31 • All results multiplied by q1(X3) • f1(3) • =q1(x3)[ f1(2)p11 + f2(2)p21 + f3(2)p31] • =.5[(.015)*(.3) + (.0177)(0) + (.1593)(0)]=.00225 Calculate the probability of all three observations ending at State2 • The following paths represent the total possible paths ending at state 2. After arriving at state 2 then you must multiply the result by q2(x3) which in our case is q2(„B‟). The observations are „AAB‟. This is given. • f1(2)*p12 f2(2)*p22 f3(2)*p32 • 01 11 12 01 12 22 01 13 32 • 02 21 12 02 22 22 02 13 31 • 03 31 12 03 32 22 03 13 32 • All results multiplied by q2(X3) • f2(3) • =q2(x3)[ f1(2)p12 + f2(2)p22 + f3(2)p32] • =.9[(.015)*(.3) + (.0177)(.4) + (.1593)(.3)]=.053433 Calculate the probability of all three observations ending at State 3 • The following paths represent the total possible paths ending at state 3. After arriving at state 1 then you must multiply the result by q3(x3) which in our case is q 3(„B‟). The observations are „AAB‟. This is given. • f1(2)*p13 f2(2)*p23 f3(2)*p33 • 01 11 13 01 12 23 01 13 33 • 02 21 13 02 22 23 02 13 33 • 03 31 13 03 32 23 03 13 33 • All results multiplied by q3(X3) • f3(3) • =q3(x3)[ f1(2)p13 + f2(2)p23 + f3(2)p33] • =.1[(.015)*(.3) + (.0177)(.4) + (.1593)(.3)]=.005937 Final Probability P(x)=P(AAB)=.00225*.1 + .053433*.2 + .005937*.4=.1032864 Summary • • • • • • • • • • • f1(1)=(.5)(.2)=.10 f2(1)=(.1)(.3)=.03 f3(1)=(.9)(.5)=.45 f1(2)=(.5)[(.10)(.3) + (.03)(0) + (.45)(0)] =.015 f2(2)=.1[ (.10)(.3) + (.03)(.4) + (.45)(.3)] = .0177 f3(2)=.9[(.10)(.3) + (.03)(.4) + (.45)(.3)] = .1593 f1(3)=.5[(.015)(.3) + (.0177)(0) + (.1593)(0) ]=.00225 f2(3) =.9[(.015)(.3) + (.0177)(.4) + (.1593)(.3)] =.053433 f3(3) = .1[(.015)(.3) + (.0177)(.4) + (.1593)(.3)] = .005937 P(x)= (.00225)(.1) + (.053433)(.2) + (.005937)(.4) =.0132864 • The previous calculation was the forward algorithm. There is a similar algorithm called the backward algorithm except it reads the sequence backwards. There is a decoding procedure that is based on the Viterbi Algorithm that requires quantities calculated by both the forward and the backward algorithm. These procedures are included in the computations in the program MUMMALS. Backward Algorithm using the same path AAB • • • • • • • • • • • • • • • • b1(3)=p10=.1 represents going from state G1 (and last char) to the very end b2(3)=p20=.2 represents going from state G2 (and last char) to the very end b3(3)=p30=.4 represents going from state G3 (and last char) to the very end b1(2)=p11q1(x3)p10 + p12q2(x3)p20 + p13q3(x3)p30 =p11q1(x3)b1(3) + p12q2(x3)b2(3) + p13q3(x3)b3(3) =p11q1(B)b1(3) + p12q2(B)b2(3) + p13q3(x3)b3(3) = (.3)(.5)(.1) + (.3)(.9)(.2) + (.3)(.1)(.4) =.015 + .054 + .012=.081 b2(2)=p21q1(x3)p10 + p22q2(x3)p20 + p23q3(x3)p30 =p21q1(B)b1(3) + p22q2(B)b2(3) + p23q3(B)b3(3) =(0)(.5)(.1) + (.4)(.9)(.2) +(.4)(.1)(.4) = 0 + .072 + .016 = .088 b3(2)=p31q1(x3)p10 + p32q2(x3)p20 + p33q3(x3)p30 =p31q1(B)b1(3) p32q2(B)b2(3) + p33q3(B)b3(3) =(0)(.5)(.1) + (.3)(.9)(.2) + (.3)(.1)(.4) =0+ .054 + .012=.066 Continue backward algorithm • • • • • • • • • • • • • • • • • • b1(1)=p11q1(x2)[p11q1(x3)p10 + p12q2(x3)p20 + p13q3(x3)p30] + p12q2(X2)[p21q1(x3)p10 + p22q2(x3)p20 + p23q3(x3)p30] + p13q3(x2)[p31q1(x3)p10 + p32q2(x3)p20 + p33q3(x3)p30] = p11q1(x2)*b1(2) + p12q2(X2)b2(2) + p13q3(x2)b3(2) =(.3)(.5)(.081) + (.3)(.1)(.088) + (.3)(.9)(.066) =.03261 b2(1)=p21q1(x2)[p11q1(x3)p10 + p12q2(x3)p20 + p13q3(x3)p30] + p22q2(X2)[p21q1(x3)p10 + p22q2(x3)p20 + p23q3(x3)p30] + p23q3(x2)[p31q1(x3)p10 + p32q2(x3)p20 + p33q3(x3)p30] = p21q1(x2)*b1(2) + p22q2(X2)b2(2) + p23q3(x2)b3(2) =(0)(.5)(.081) + (.4)(.1)(.088) + (.4)(.9)(.066) =.02728 b3(1)=p31q1(x2)[p11q1(x3)p10 + p12q2(x3)p20 + p13q3(x3)p30] + p32q2(X2)[p21q1(x3)p10 + p22q2(x3)p20 + p23q3(x3)p30] + p33q3(x2)[p31q1(x3)p10 + p32q2(x3)p20 + p33q3(x3)p30] = p31q1(x2)*b1(2) + p32q2(X2)b2(2) + p33q3(x2)b3(2) =(0)(.5)(.081) + (.3)(.1)(.088) + (.3)(.9)(.066) =.02046 Continue backward algorithm • • • • • • • • • b1(3)=.1 b2(3)=.2 b3(3)=.4 b1(2)=(.3)(.5)(.1) + (.3)(.9)(.2) + (.3)(.1)(.4)=.081 b2(2)=(0)(.5)(.1) + (.4)(.9)(.2) +(.4)(.1)(.4)=.088 b3(2)=(0)(.5)(.1) + (.3)(.9)(.2) + (.3)(.1)(.4)=.066 b1(1)=(.3)(.5)(.081) + (.3)(.1)(.088) + (.3)(.9)(.066)=.03261 b2(1)=(0)(.5)(.081) + (.4)(.1)(.088) + (.4)(.9)(.066)=.02728 b3(1)=(0)(.5)(.081) + (.3)(.1)(.088) + (.3)(.9)(.066)=.02046 Uses of the forward and backward algorithm • • • • • • • Let i = 1,2, …. , l where x is a sequence = x1,x2,x3, xL Given an HMM model with k states (G1,G2,…Gk E(I,k) = the event where xi is emitted by Gk E(x) is the event of the sequence x being generated It can be shown that P(E(I,k)/E(x))=[fk(i)*bk(i)]/P(x) This is called the posterior probability of state Gk at observation I given x. We can therefore use the forward and backward algorithm to determine the probable states to find the most probable path through the HMM by looking for the maximum value. This can be verified by the Viterbi Algorithm Continue backward algorithm • • • • • Finally P(x)=p01q1(x1)*b1(1) + p02q2(x1)*b2(1) + p03q3(x1)*b3(1) =(.2)(.5)(.03261) + (.3)(.1)(.02728) + (.5)(.9)(.02046) = .0132864 This agrees with the same answer as the forward algorithm. So we found that the probability of generating AAB over this HMM is .0132864 • Build a profile HMM D1 D2 D3 I0 I1 I2 I3 B M1 M2 M3 E Profile HMM • • • • Mj and Ij are called non silent states and Dj is called a silent state The connectivity shown is of length 3 which is equal to the number of match states In general the length is derived from is derived from the given multiple alignment. One simple rule is to set the length equal to the number of columns in a multiple alignment where gaps occupy no more than half of the positions These columns will be called match states. All other columns will be called insert columns. Rules: if a letter is found in the match column we assign a match state to it. If a letter is found in an insert column , we assign an insert state. If a gap is found in a match column we assign a delete state to it. If a gap is found in an insert column we skip and assign nothing. • • Q= {A,B}; F={BAAB,AABA,BBB,AA} • • • • • • • Suppose we 1 2 3 X1: _ B A X2: A _ A X3: _ _ B X4: _ _ A are given a multiple alignment 4 5 6 7 _ A B _ B _ A _ _ B _ B _ _ A _ Using the rule of match columns have not more than 50 % gaps. The match columns are 3, 5 and 6. The other columns are insert columns. Determine the paths • Path1: 0 I0 M1 M2 M3 0 • • • • Path 2: 0 I0 M1 I1 D2 M3 0 Path 3: 0 M1 M2 D3 I3 0 Path 4: 0 M1 D2 M3 0 We can now determine the transitions probabilities from state to state Transition Probabilities • • • • • • • • • • p(0 I0)=.5 p(M1 I1)=.25 p(M2 I2)=0 p(M3 I3)=0 p(I0 I0)=0 p(I1 I1)=0 p(I3 I3)=0 p(D2 I2)=0 p(D3 I3)=1 p(0 D1)=0 p(M1 D2)=.25 p(M2 D3)=.5 p(M3 0)=1 p(I0 D1)=0 p(I1 D2)=1 p(I3 0)=1 p(D2 M3)=1 p(D3 0=0 p(0 M1=.5) p(M1 M2)=.5 p(M2 M3)=.5 p(I0 M1)=1 p(I1 M2)=0 p(D2 D3)=0 There is no transition probabilities starting with I2 because I2 does not appear in the paths. It is a common problem when developing HMM so adding pseudocounts is frequently used. Emission probabilities • • • • • • • • qM1(A)= .75 qM2(A)=.5 qM3(A)=2/3 qI0(A)=.5 qI1(A)=0 qI3(A)=0 qM1(B)=.25 qM2(B)=,5 qM3(B)=1/3 qI0(B)=.5 qI1(B)=1 qI3(B)=1 No emission probabilities for I2 since it is not contained in the data This was an example of building a HMM from a given alignment. Basic Problems with HMMs • 1) Given multiple sequence alignment, how to build a profile HMM. Create Profile HMMs that model multiple alignment. Set up connectivity. Determine transition and emission probabilites • 2) How to evaluate the probability of sequence given a Profile HMM. Use it to detect potential membership of a query sequence in a sequence family. Use the forward or backward algorithm to determine the probability. • 3)How to find the most probable path between a sequence and a profile HMM. Use Viterbi‟s algorithm Multiple sequence alignment by Profile HMM • • • • • • • • • Consider the Family F given by the alphabet {A,C,G,T} and F = { x1= ACCG, x2 = TCCG, x3 = AT, x4 = CG , X5=ATC, X6 = TTG} Since the average length of the sequences is 3 we will use the previous model with 3 Match States to determine the multiple alignment. Use the Baum-Welch training to determine the transition probabilities and the emission probabilities. The next step is to use the Viterbi Algorithm for each sequence to determine the most likely paths through the resulting model. The paths are Path1 0 M1 M2 M3 I3 0 Path2 0 I0 D1 I1 M2 D3 I3 0 Path3 0 D1 M2 M3 0 Path4 0 M1 M2 D3 0 Path5 0 D1 M2 D3 I3 I3 0 Path 6 0 D1 D2 M3 I3 I3 0 Multiple sequence alignment by Profile HMM • • • • • • • • • • • • • • • • • • • • These paths lead to three possible multiple alignments Using the match state criteria of at least two nucleotides are needed for a match state and not attempting to align I3 putting those nucleotides in lower case. We get X1: _ _ A C C g X2: T G _ C _ g X3: _ _ _ A T _ _ X4: _ _ C G _ _ _ X5: _ _ _ A _ t c X6: _ _ _ _ T t g X1: X2: X3: X4: X5: X6: X1: X2: X3: X4: X5: X6: _ T _ _ _ _ A _ _ C _ _ A_ _G _ _ C_ _ _ _ _ C C A G A _ Cg _g T_ __ _ t T t _ _ c g _ _ CCg TG C_ g _ _ A T_ _ _ _ G__ _ __ A_ t c __ _ T tg Example: Two dices1 • One dice is fair, producing {1,2,3,4,5,6} with equal probability. The other one is loaded and produce number „6‟ with higher probability than the rest numbers. • Each time only one dice is rolled and produces numbers in the set {1,2,3,4,5,6} which can be observed. But we do not know which of the two dices are rolled. • The transitions between the fair dice and loaded dice follow the Markov chain properties. 6 4 5 3 1 3 2 6 5 1 4 5 6 3 6 6 6 4 6 3 1 6 5 6 6 6 2 4 5 3 … F F F F F F F F F F F F L L L L L L L L L L L L L L F F F F … S1 0.9 S2 0.1 0.9 fair 0.1 loaded 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 Modeling CpG islands • We can treat the CpG island and non-CpG island DNA sequences as two distinct hidden states, each of which emits {A, T, C, G} with its own probabilities. S1 S2 0.1 NonCpG 0.9 0.9 CpG 0.1 A: 1/6 T: 1/6 C: 1/3 G: 1/3 A: 1/4 T: 1/4 C: 1/4 G: 1/4 Elements of a hidden markov model An HMM is characterized by the follow elements: 1. 2. 3. N, the number of hidden states. {S1, S2, … SN} ( N = 2: {Fair dice, loaded dice} ) M, the number of distinct observations. {V1, V2, … VM} (M = 6: {1, 2, 3, 4, 5, 6}) Transition probabilities between the hidden states: aij = P(Sj | Si is the previous state) for each i, j from 1 to N. ( P(F|F) = 0.9, P(F|L)=0.1, P(L|L) = 0.9, P(L|F) = 0.1 ) Emission probability of each observation for each hidden state: bi(k) = P(Vk|Si) (bFair(k) = 1/6 for k = 1,2,3,4,5,6; bLoaded(k)=1/10 for k = 1,2,3,4,5 and bLoaded(6) = 1/2) Initial probabilities: probability of each state being the first state. 4. 5. Modeling CpG islands • Number of hidden states: 2 (CpG state and non-CpG state) • Number of observations: 4 (A, T, C, G) • Transition probabilities P(S1S2) = 0.1 P(S1S1) = 0.9 P(S2S1) = 0.1 P(S2S2) = 0.9 • Emission probabilities: S1: P(A)=P(T) = 1/6 P(C)=P(G)= 1/3 S2: P(A)=P(T)=P(C)=P(G)=1/4 S1 S2 0.1 NonCpG 0.9 0.9 CpG 0.1 A: 1/6 T: 1/6 C: 1/3 G: 1/3 A: 1/4 T: 1/4 C: 1/4 G: 1/4 Pairwise alignment HMM Hidden states: M state: emit an aligned amino acid pair X state: only emit an amino acid in sequence 1 Y state: only emit an amino acid in sequence 2 Observations: amino acid sequences 1 and 2 Sequence 1: EGKNERTYCVHG---GER Sequence 2: EA--DKTFVVEGATLGEP Hidden state path: MMXXMMMMMMMMYYYMMM 1-2*d e X 1-e d M 1-e d Y e Question: how many observations? Variations of profile HMMs • BLOCKS1-type: one motif no insertions or deletions allowed. Begin M M M end Variations of profile HMMs • BLOCKS-type: one motif no insertions or deletions allowed. Begin M M M end • META-MEME2: multiple motifs, no insertions or deletions allowed in motifs I I I Begin M M M M M M end 2. Grundy et al. http://www.cse.ucsd.edu/users/bgrundy/metameme.1.0.html Variations of profile HMMs • BLOCKS-type: one motif no insertions or deletions allowed. Begin M M M end • META-MEME: multiple motifs, no insertions or deletions allowed in motifs I I I Begin M M M M M M end C-terminal unaligned • HMMER23 in pfam N-terminal unaligned 3. Eddy S. http://hmmer.wustl.edu/ Allow multi-domain match Model construction: which columns correspond to match states? gi|48849187 gi|52011825 gi|13476651 gi|13477050 gi|16264316 gi|48862556 gi|52012332 gi|22956654 gi|6714967 gi|46578620 gi|46581270 gi|48763048 gi|23115751 gi|21242355 gi|34540133 ---------------MKALTPAKVAP-LYKAWYWDKVGGDDLPA-PIALC ----LRRRQ-LSSRSVKLITMEEVKA-IYRSQYWDKVGGDELPG-GLDYC ----RKGKG-LATRSVKGITTDELNA-IYDRQYWDAVRGDDLPA-GVDYV ----RKGKG-LATRSVKSITTAELNE-IYDRQYWDAVQGDALPA-GVDYV ----------MSADQVRAMSREEAAD-IYRRSYWPQCGVDLLPP-GLDYA ----L-GRP-ASIADVKAMDIETAKE-IYLANYYYKPRINGLPD-EIQPL --DLT-GDE-VDARDVQALSRKEAVE-IFLREYYRRPRLHLLPA-PLQPS --DTT-GDG-VQVEDVRALTHAQAES-IFIEHYFRRPGLGNLPE-PLQPG --DLN-RDG-VSARDLRLVSPDQAAE-IYVSQYFKAPKLDQLPA-VVQAP EGDVD-GDGDVDADDIRGLTPEQATA-LYHRHFWLRPGLQSLPQ-GIAIC EGDVD-GDGDVDADDIRGLTPEQATA-LYHRHFWLRPGLQSLPQ-GIAIC --DLD-GDGDTDKDDIVLVTRDRAAA-LYRRDFFYGPRLHALPC-SLWPV --ALL-GVAP-TLENLQALTEPQAAA-LYKALYWDRIDGDAIASQLLAEI --DLL-GIEP-SLANLRALTDAQAGV-IYKAQYWDRLRADQIALQPLANI --DKD-GDGDIDVEDLKLLTDDDVLNRVLKPFYWDRWKADLIESQKVANI *** *** **************** ***************** ***** One simple way to determine the match states is to use a gap threshold, e.g. 50%. There are also ways to find the optimal model by probability estimation methods. Nucleic Acids Research "> Pei, J. et al. Nucl. Acids Res. 2006 34:4364-4374; doi:10.1093/nar/gkl514 Copyright restrictions may apply. Software to experiment with • Mummals: multiple sequence alignment improved by using hidden Markov models with local structural information • This program constructs multiple protein sequence alignments using probabilistic consistency. • Mummals improves alignment with multiple match states that describe local structural information Nucleic Acids Research "> Pei, J. et al. Nucl. Acids Res. 2006 34:4364-4374; doi:10.1093/nar/gkl514 Copyright restrictions may apply. Description Sequences 1 and 2, uppercase and lowercase letters represent aligned core blocks and unaligned core blocks respectively. If 2 corresponding unaligned regions bounded by the same two core blocks are of different length we split the shorter one into two pieces and introduce contigious gaps in the middle. For both N- and C- terminus ends, the shorter unaligned region is pushed toward the core blocks. Secondary structure (SS)(h helix; e strand; c coil) is shown for the first sequence. The hidden paths are shown below the amino acid sequences. First Model • Model Structure HMM_1_1_0 • Residue pairs in the unaligned regions using the same match state as those in the aligned blocks . Insertions in the first and secons sequence are modeled by using states x and y. Second Model • HMM_1_1_1 • Residues in the unaligned region are modeled using a different match state ( U) that the ,match states in the core blocks(M) Third Model • HMM_1_3_1 • Residue pairs in the aligned core blocks are modeled using 3 match states H ( Helix) S (Strand) and C for Coil according to the three secondary structures of the first sequence. Some methods for Constructing multiple alignments • Method1 – guided by a tree that reflects the similarities among sequences, a progressive alignment makes a series of pairwise alignments for neighboring sequences or pre-aligned groups. In this way, similar sequences are aligned prior to divergent sequences. Method2 – To correct or minimize errors in progressive alignment an iterative refinement of the alignment after progressive steps is performed by repeatedly dividing the aligned sequences into two groups and then realigning the groups. Method 3 – makes a consistency measure among a set of pairwise a sequence alignment before the progressive alignment steps. Method 4 – another approach to constructing a pairwise refinement is to use simple HMM for global pairwise alignment. Aligned residue pairs are modeled by a hidden match state while insertions and deletions are modeled by two hidden states that generate unmatched residues in either of the two sequences. • • • MUMMALS improves alignment by 1) More complex pairwise alignment HMM incorporates local structural alignment 2) Better estimation of HMM parameters from a large set of structural alignments of domain pairs from the SCOP database MUMMALS • As mentioned the novelty of these HMMs is the introduction of more match sites based on DaliLite Structural alignments, which have aligned core blocks (structurally superimposable) shown in uppercase letters and unaligned regions (structurally not superimposable) in lower case letters. Letters aligned in the core blocks are modeled by a match state (M). If two unaligned regions in between two core blocks have different and non-zero lengths they can be modeled by a different match state(U) or X and Y if there is a deletion in one of the sequences. This is the model HMM_1_1_1 which has two match states M and U. • H_1_3_1 incorporates the match states Helix, Strand and Coil. The three match states have different sets of emission probabilities for the 400 residue pairs. Pairwise alignment with optimal match probabilities • • • • forward and backward algorithms can be applied to align two sequences x={x1,x2,…..,xm} and y={y1,y2,…..,yn} Given an HMM and the forward algorithm calculates the probability of observing two subsequences {x1,x2,….,xi} and {y1,y2,…….,yj} with the last two positions been in a Hidden State. (i, j)=K where k is a hidden state form the set {„H‟, „S‟, „C‟ , „U‟, „X‟ , „Y‟} The backward algorithm calculates the probability of generating two subsequences {xi+1,……, xm} and {yj+1,…….,yn} given that the previous position of the two subsequences is of a certain type namely that the state at (i,j) is K. The alignment path is then found that maximizes the sum of match probabilities of the residue pairs. Procedure • First an initial tree is built form the sequences • An initial alignment is then built progressively guided by the tree with a simple sum-of pairs scoring method. • A second tree is built using the UPGMA method based on sequence identities calculated form the initial alignment. • For each sequence pair the match probabilities are calculated using one of the HMMs. • Finally MUMMALS progressively aligns the sequences guided by the second tree using a consistency based scoring function. Comparison of other tools • Evaluate and compare statistically accuracy of Mummals versus other tools. • Several leading aligners such as ProbCons, MAFFT,Clustalw and MUSCLE will be used to test the statistical accurracy versus Mummals Comparison • Mummals out performed the aforementioned multiple alignment techniques according to their Q - scores and the wilcoxin signed rank test with a p value less than 0.01 according to their testing data. • A Q-score is the fraction of the number of the number of correctly aligned residue pairs in the test alignment to the total number of pairs in the reference alignment. Analysis of Time Complexity • Trade off for improvement of alignment quality with more complex model structures is more time for computation Running time • The rate limiting steps are the computation of the match probabilities using the forward and backward algorithms. • The time order is (N^2)(L^2)(K^2) • N is the number of sequences • L is the average length • K is the number of hidden states Example of running time • The medium cpu time of MUMMALS with model HMM_1_3_1 on 1785 SCOP40 pairs is 174 s per alignment as compared to 2.5s per alignment for MAFFT • MUMMALS is much slower than MAFFT, MUSCLE and ClustalW Supplementary • PROMALS constructs multiple protein sequence alignments using information from database searches and secondary structure prediction. PROMALS may be an improvement over MUMMALS