# Pssm and HMM

Document Sample

```					Scoring Matrices
Different types of matrices
Matrices used

PSSM = Position Specific Scoring
Matrices
PAM matrices
BLOSUM = BLOCK SUBSTITUTION MATRIX
Position-Specific Scoring Matrix
 A PSSM is a motif descriptor
 The descriptor     includes a weight
(score, probability) for each symbol
occurring at each position along the
motif
 Examples of motifs:
 Protein active sites, structural elements,
zinc finger, intron/exon boundaries,
transcription-factor binding sites, etc.
Position-Specific Scoring Matrix
Construction of PSSM is a multi-stage
process:
1. Architecture of matrix
2. Create multiple alignment from which
the matrix is derived
3. Calculate frequencies for each position
4. Applying BLAST to PSSM
Position-Specific Scoring Matrix
 10 vertebrate donor site sequences
aligned at exon/intron boundary
seq 1    GAGGTAAAC
seq 2    TCCGTAAGT
seq 3    CAGGTTGGA
seq 4    ACAGTCAGT
seq 5    TAGGTCATT
seq 6    TAGGTACTG
seq 7    ATGGTAACT
seq 8    CAGGTATAC
seq 9    TGTGTGAGT
seq 10   AAGGTAAGT
Position-Specific Scoring Matrix
 Calculate the absolute frequency of
each nucleotide at each position
seq 1    GAGGTAAAC       1   2   3   4   5   6   7   8   9
seq 2    TCCGTAAGT   A
seq 3    CAGGTTGGA   C
seq 4    ACAGTCAGT   G
seq 5    TAGGTCATT   T
seq 6    TAGGTACTG
seq 7    ATGGTAACT
seq 8    CAGGTATAC
seq 9    TGTGTGAGT
seq 10   AAGGTAAGT
Position-Specific Scoring Matrix
 Calculate the absolute frequency of
each nucleotide at each position
seq 1    GAGGTAAAC       1   2   3   4    5    6   7   8   9
seq 2    TCCGTAAGT   A   3   6   1   0    0    6   7   2   1
seq 3    CAGGTTGGA   C   2   2   1   0    0    2   1   1   2
seq 4    ACAGTCAGT   G   1   1   7   10   0    1   1   5   1
seq 5    TAGGTCATT   T   4   1   1   0    10   1   1   2   6
seq 6    TAGGTACTG
seq 7    ATGGTAACT
seq 8    CAGGTATAC
seq 9    TGTGTGAGT
seq 10   AAGGTAAGT
Position-Specific Scoring Matrix
 Calculate the relative frequency of
each nucleotide at each position
seq 1    GAGGTAAAC               1   2   3   4    5    6       7       8       9
seq 2    TCCGTAAGT       A       3   6   1   0    0    6       7       2       1
seq 3    CAGGTTGGA       C       2   2   1   0    0    2       1       1       2
seq 4    ACAGTCAGT       G       1   1   7   10   0    1       1       5       1
seq 5    TAGGTCATT       T       4   1   1   0    10   1       1       2       6
seq 6    TAGGTACTG
seq 7    ATGGTAACT
seq 8    CAGGTATAC
seq 9    TGTGTGAGT           1       2   3   4     5       6       7       8       9
seq 10   AAGGTAAGT   A
C
G
T
Position-Specific Scoring Matrix
 Calculate the relative frequency of
each nucleotide at each position
seq 1    GAGGTAAAC               1    2     3    4    5    6       7       8         9
seq 2    TCCGTAAGT       A       3    6     1    0    0    6       7       2         1
seq 3    CAGGTTGGA       C       2    2     1    0    0    2       1       1         2
seq 4    ACAGTCAGT       G       1    1     7    10   0    1       1       5         1
seq 5    TAGGTCATT       T       4    1     1    0    10   1       1       2         6
seq 6    TAGGTACTG
seq 7    ATGGTAACT
seq 8    CAGGTATAC
seq 9    TGTGTGAGT           1       2     3     4     5       6       7       8         9
seq 10   AAGGTAAGT   A       0.3     0.6   0.1   0     0   0.6     0.7         0.2       0.1
C       0.2     0.2   0.1   0     0   0.2     0.1         0.1       0.2
G       0.1     0.1   0.7   1     0   0.1     0.1         0.5       0.1
T       0.4     0.1   0.1   0     1   0.1     0.1         0.2       0.6
Position-Specific Scoring Matrix
 What is the probability of finding
CAGGTTGGA?
 The product of the frequency of each
nucleotide at each position:
 C is 0.2 at position 1, A is 0.6 at position
2, etc -> 0.2 * 0.6 * 0.7 * 1 * 1 * 0.1 *
0.1 * 0.5 * 0.1      1  2  3  4  5  6  7  8                       9

A   0.3   0.6   0.1   0   0   0.6   0.7   0.2   0.1

C   0.2   0.2   0.1   0   0   0.2   0.1   0.1   0.2

G   0.1   0.1   0.7   1   0   0.1   0.1   0.5   0.1

T   0.4   0.1   0.1   0   1   0.1   0.1   0.2   0.6
HMM (hidden Markov model)
HMMs and their Usage
 HMMs are very common in
Computational Linguistics:
 Speech recognition (observed: acoustic
signal, hidden: words)
 Handwriting recognition (observed: image,
hidden: words)
 Part-of-speech tagging (observed: words,
hidden: part-of-speech tags)
 Machine translation (observed: foreign
words, hidden: words in target language)
Hidden Markov Model (HMM)
 HMMs allow you to estimate
probabilities of unobserved events
 Given plain text, which underlying
parameters generated the surface
 E.g., in speech recognition, the
observed data is the acoustic signal
and the words are the hidden
parameters
Markov Chains
 Given a finite discrete set S of
possible states, a Markov chain
process occupies one of these states
at each unit of time.
 The process either stays in the same
state or moves to some other state in
S.
 This occurs in a stochastic way,
rather than in a deterministic one.
A simple example
 Consider a 3-state Markov model of the weather. We
assume that once a day the weather is observed as
being one of the following: rainy or snowy, cloudy,
sunny.
 We postulate that on day t, weather is characterized
by a single one of the three states above, and give
ourselves a transition probability matrix A given by:

 0 .4 0 .3 0 .3 
                  
 0 .2 0 .6 0 . 2 
 0 . 1 0 . 1 0 .8 
                  
 Given that the weather on day 1 is
sunny, what is the probability that
the weather for the next 7 days will
be    “sun-sun-rain-rain-sun-cloudy-
sun”?
Hidden?
 What if each state does not
correspond    to  an observable
(physical) event?
The Structure of a Profile HMM

Squares: main states
Diamonds: insert states
Circles: delete states,
silent states
A Hidden Markov Model

insertion node

node 1   node 2   node 3     node 4     node 5   node 6
First Three and Last Three Columns

 Column 1: 4 A’s and 1 T
 probability for A is 0.8
 probability for T is 0.2

A C A - - - A T G
T C A A C T A T C
A C A C - - A G C
A G A - - - A T C
A C C G - - A T C
A C A - - - A T G
T C A A C T A T C
A C A C - - A G C
Insertions                      A G A - - - A T C
A C C G - - A T C

 Columns 4, 5, 6 are the insertions
 At the fourth column, 3 out of 5
sequences have insertions
 the probability of transition from the
third node to the insertion node is 0.6
 In the insertion node, 1 A, 2 C’s, 1 G,
1T
 the probabilities of A, C, G, T are 0.2,
0.4, 0.2, 0.2
Two Uses of a Markov Model

 Generate sequences according to the
probabilities

 Compute the probability of a sequence
A Markov Model Generating
Random DNA Sequences

A      C

begin                    end

G      T
A Good Introduction to HMM
 The examples in the following slides are
taken from:
 An introduction to hidden Markov models
for biological sequences
 Anders Krogh
 In Computational Methods in Molecular
Biology, edited by S. L. Salzberg, D.B.
Searls, and S. Kasif, pages 45-63, Elsevier,
1998
 http://www.binf.ku.dk/users/krogh/publicat
ions/ps/Krogh98a.pdf

```
DOCUMENT INFO
Categories:
Tags:
Stats:
 views: 9 posted: 3/1/2013 language: pages: 34