# Sequence Alignments by nyut545e2

VIEWS: 4 PAGES: 53

• pg 1
```									Alignments, Matrices &
Markov Models
Chris Bailey
Bacterial Pathogenesis & Genomics Unit
cmb036@bham.ac.uk
The Matrix
• When analysing nucleotide sequences:
– Nucleotides match (G=G) …
– Or they don’t (G≠C)
• One nucleotide substitution is no more
relevant than another
• (Except in the light of what that nucleotide
sequence ends up coding for)
The Matrix
• But in proteins:
– Some amino acids are readily substitutable (i.e.
1 hydrophobic residue for another)
– And others are badly tolerated (i.e. 1
hydrophobic residue for a charged residue)
• Assuming you want the 2º & 3º structure to
be the same
Accommodating for Change
• So we adapt our scoring function (s)
• So that scores for matches and mismatches
take account of amino acid substitutability
• We do this using protein substitution
matrices
What’s a matrix
• Lookup grid (20x20), with each row/column
containing an amino acid

Cys Gly Pro Ser        Trp
Cys
Gly
Pro
Ser

Trp
What’s a Matrix
• Values in the matrix are scores of one
amino acid vs another
• For example:
– Big positive score: amino acid 1 is readily
substitutable for amino acid 2
– Big negative score: amino acid 1 is not
substitutable for amino acid 3
Part of a Matrix

Cys   Gly   Pro   Ser   Ala   Trp
Cys   12    -3    -3    0     -2    8
Gly   -3    5     -1    1     1     -7
Pro   -3    -1    6     1     1     -6
Ser   0     1     1     1     1     -2
Ala   -2    1     1     1     2     -6
Trp   -8    -7    -6    -2    -6    17
How do we fill the table?
• Obvious what the values should represent
(protein similarity)
• But how do we calculate the actual numbers
• Since we are looking for homology an
evolutionary model is a good place to start
Why look at evolution?
• Homology is a binary property
• Similarity ≠ Homology
• Homology indicates two proteins had a
common ancestor
The PAM Matrix
• PAM = Point Accepted Mutation
• Construct 71 phylogenetic trees of protein
families
• Observe amino acid substitutions on each
branch of tree
• Also need probability of occurrence for each
amino acid (pa)
The PAM Matrix
• Using substitution data calculate fab the
observed frequency of the mutation a ↔ b
• Also note that fab = fba

• Using this information calculate fa, the total
number of mutations in which a involved
The PAM Matrix
• Using substitution data calculate fab the
observed frequency of the mutation a ↔ b
• Also note that fab = fba

• Using this information calculate fa, the total
number of mutations in which a involved
fa   fab
ba
The PAM Matrix
• And also calculate f, the total occurences of
amino acid substitutions
The PAM Matrix
• And also calculate f, the total occurences of
amino acid substitutions

f   fa
a
The PAM Matrix
• And also calculate f, the total occurences of
amino acid substitutions

f   fa
a

• From here we go on to calculate relative
mutability:
The PAM Matrix
• And also calculate f, the total occurences of
amino acid substitutions

f   fa
a

• From here we go on to calculate relative
mutability:
fa
ma 
100 f pa
The PAM Matrix
• Relative mutability: Probability that a given
amino acid will change in the evolutionary
period of interest
• Now we calculate the matrix …
The PAM Matrix
• 20 x 20 Matrix where Mab is the probability
of amino acid a changing into amino acid b
• Maa = 1 – ma
• Mab is more complicated & requires
conditional probability
– E.g. P(A and B) = P(A)∙P(B|A)
The PAM Matrix
• In this case:
The PAM Matrix
• In this case:
Mab  P(a  b | a changed) P(a changed)
The PAM Matrix
• In this case:
Mab  P(a  b | a changed) P(a changed)
• Or:
fab
Mab      ma
fa
The PAM Matrix
• These equations allow us to calculate a
PAM1 matrix
• The number after PAM is the number of
amino acid substitutions per 100 residues:
– PAM40 – 40 substitutions per 100 residues
– PAM250 – 250 substitutions per 100 residues

• All matrices calculated by multiplication of
PAM1 matrix
The PAM matrix
• The final scores in a PAM matrix are
expressed as a lod (logarithm of odds) score
• Compare probability of mutation vs probability
of random occurrence
The PAM matrix
• The final scores in a PAM matrix are
expressed as a lod (logarithm of odds) score
• Compare probability of mutation vs probability
of random occurrence
• Gives odds ratio: Mab
pb
The PAM matrix
• The final scores in a PAM matrix are
expressed as a lod (logarithm of odds) score
• Compare probability of mutation vs probability
of random occurrence
• Gives odds ratio: Mab
pb
• Scoring Matrix S is calculated by:
The PAM matrix
• The final scores in a PAM matrix are
expressed as a lod (logarithm of odds) score
• Compare probability of mutation vs probability
of random occurrence
• Gives odds ratio: Mab
pb
• Scoring Matrix S is calculated by:
 Mab 
 pb 
Sab  10 log 10     
     
The full PAM 250 matrix
C   12
S    0   2
T   -2   1    3
P   -3   1    0    6
A   -2   1    1    1    2
G   -3   1    0    -1   1    5
N   -4   1    0    -1   0    0    2
D   -5   0    0    -1   0    1    2    4
E   -5   0    0    -1   0    0    1    3    4
Q   -5   -1   -1   0    0    -1   1    2    2    4
H   -3   -1   -1   0    -1   -2   2    1    1    3    6
R   -4   0    -1   0    -2   -3   0    -1   -1   2    2    8
K   -5   0     -   -1   -1   -2   1    0    0    2    0    3    5
M   -5   -2   -1   -2   -1   -3   -2   -4   -2   -2   -2   0    0    6
I   -2   -1   0    -2   -1   -3   -2   -3   -2   -3   -2   -2   -2   2    5
L   -8   -3   -2   -3   -2   -4   -3   -4   -3   -3   -2   -3   -3   4    2    8
V   -2   -1   0    -1   0    -1   -2   -3   -2   -3   -2   -2   -2   2    4    2    4
F   -4   -3   -3   -5   -4   -5   -4   -6   -5   -5   -2   -4   -5   0    1    2    -1   9
Y    0   -3   -3   -5   -3   -5   -2   -4   -4   -4   0    -4   -4   -2   -1   -1   -2   7   10
W   -8   -2   -5   -6   -6   -7   -4   -7   -7   -5   -3   2    -3   -4   -5   -2   -6   0    0   17
C    S    T    P    A    G    N    D    E    Q    H    R    K    M     I   L    V    F   Y    W
The BLOSUM Matrix
• Derived using similar mathematical
principals as a PAM matrix
• However substitution data is derived in a
different manner
The BLOSUM Matrix
• Substitution data comes from multiple
alignments.
• Straightforward count of number of
substitutions in the alignement
• Number after BLOSUM (e.g. 62) denotes
the minimum level (as a %age) of similarity
between sequences within the alignments.
PAM Vs BLOSUM
•   PAM100 ↔ BLOSUM90
•   PAM120 ↔ BLOSUM80
•   PAM160 ↔ BLOSUM60
•   PAM200 ↔ BLOSUM52
•   PAM250 ↔ BLOSUM45
BLAST
• The Basic Local Alignment Search Tool
• It is a Heuristic (i.e. it does not guarantee
optimal results)
• Why …
BLAST
• Genbank currently stores about 1010
residues of protein data.
• Trying to form alignments against such a
huge database is unfeasible (even with vast
computing power)
• So we need to shortcut
How BLAST Works
• 3 steps:
– Compile a list of High Scoring words
– Search for hits – hits give seeds
– Extend seeds
Step 1 - Preprocessing
• BLAST creates a series of words, of length
W where
• W is 2..4 for proteins
• W is >10 for DNA
• These words are based on subsequences
of the query (Q)
Step 1 - Preprocessing
• Get each subsequence of Q, that is of
length W
• E.g.
LVNRKPVVP
LVN
VNR
NRK
RKP
etc…
Step 1 - Preprocessing
•   For each of these words, find similar words
•   How does blast define similarity:
•   Uses scoring matrices to score 2 words
•   E.g.
W1 R K P
W2 R R P
9 -1 7 Total = 15
Step 1 - Preprocessing
• Words are similar if their score is greater
than a value T (T = 12 usually)
• For RKP the following are examples of high
scoring words
– QKP, KKP, RQP, REP, RRP, RKP
Step 2 – Looking for Hits
• Formatdb create a hash lookup table
• E.g. ‘KKP’  12054345, 23451635,
23452152
• Maps each word to an entry in the database
of proteins
• Allows us to retrieve sequences which has a
word match in constant time
Step 3 – Extending the matches
• Starting at the word match, extend the
alignment in both directions
• Alignment scored using an adapted smith-
waterman algorithm
• Alignment stopped once score has dropped
below the specified threshold
Multiple alignments
• The problem:
• Pairwise alignment requires n2 time
• Multiple alignment requires nx where x is the
number of sequences to align
Progressive alignment
• Pairwise alignment of each combination of
sequence pairs (requires xn2 time)
• Use alignment scores to produce dendogram
using neighbour-joining method
• Align sequences sequentially, using
relationships from tree
Progressive alignment
• For 5 sequences: align 1v2, 1v3 … 4v5
• Make a tree:
1
3
4
5
2

1
3
4
5
2
Progressive alignment
1
3

1.

4
2.          5

1
3
4
3.          5

1
3
4
4.          5
2
Hidden Markov models
• Prediction tool
• Two related things X and Y
• Using information about X and Y to build
model
• Predict Y using X or vice versa
Markov Models
• Defines a series of states and the
probability of moving to another state
• Eg. The weather

Sun            Cloud           Rain
Markov Models
• Example probabilities:

Weather Today
Sun     Cloud   Rain
Sun        0.5     0.25    0.25

Weather       Cloud    0.375       0.125   0.375
Yesterday
Rain    0.125       0.625   0.375

• This is a state transition matrix (A)
Markov Models
• Initialise the system use a vector of initial
probabilities ( vector)
• In this case the type of weather on day 1.
Hidden Markov Models
• Use something observable to predict
something hidden
• E.g. Barometric pressure and the weather (if
you couldn’t see outside)
• Every observable state is connected to
every hidden state
Hidden Markov Models

Observable   28” Hg   29” Hg           30” Hg   31” Hg

Hidden
Sun             Cloud            Rain
Hidden Markov Models
• Also need a probability matrix for connections
between observable & hidden states (confusion
matrix) (B)

Weather Today

28”      29”     30”      31”
Sun 0.60        0.20    0.15    0.05
Weather     Cloud 0.25       0.25    0.25    0.25
Yesterday
Rain 0.05       0.10    0.35    0.50
Hidden Markov Models
• You can do many different things using
HMMs
– Match a series of observation to a HMM
(Evaluation)
– Determine the hidden sequence most likely to
have generated a sequence of observations
(Decoding)
– Determining the parameters (, A and B) most
likely to have generates a sequence of
observations (Learning)
Hidden Markov models
• There are a series of algorithms that you
can use to
– Evaluate (Forward algorithm)
– Decode (Viterbi algorithm)
– Learn (Forward-Backward algorithm)
Hidden Markov Models

```
To top