Docstoc

Pssm and HMM

Document Sample
Pssm and HMM Powered By Docstoc
					Scoring Matrices
Different types of matrices
Matrices used




PSSM = Position Specific Scoring
        Matrices
PAM matrices
BLOSUM = BLOCK SUBSTITUTION MATRIX
Position-Specific Scoring Matrix
 A PSSM is a motif descriptor
 The descriptor     includes a weight
  (score, probability) for each symbol
  occurring at each position along the
  motif
 Examples of motifs:
   Protein active sites, structural elements,
    zinc finger, intron/exon boundaries,
    transcription-factor binding sites, etc.
Position-Specific Scoring Matrix
Construction of PSSM is a multi-stage
  process:
  1. Architecture of matrix
  2. Create multiple alignment from which
     the matrix is derived
  3. Calculate frequencies for each position
  4. Applying BLAST to PSSM
Position-Specific Scoring Matrix
 10 vertebrate donor site sequences
  aligned at exon/intron boundary
 seq 1    GAGGTAAAC
 seq 2    TCCGTAAGT
 seq 3    CAGGTTGGA
 seq 4    ACAGTCAGT
 seq 5    TAGGTCATT
 seq 6    TAGGTACTG
 seq 7    ATGGTAACT
 seq 8    CAGGTATAC
 seq 9    TGTGTGAGT
 seq 10   AAGGTAAGT
Position-Specific Scoring Matrix
 Calculate the absolute frequency of
  each nucleotide at each position
 seq 1    GAGGTAAAC       1   2   3   4   5   6   7   8   9
 seq 2    TCCGTAAGT   A
 seq 3    CAGGTTGGA   C
 seq 4    ACAGTCAGT   G
 seq 5    TAGGTCATT   T
 seq 6    TAGGTACTG
 seq 7    ATGGTAACT
 seq 8    CAGGTATAC
 seq 9    TGTGTGAGT
 seq 10   AAGGTAAGT
Position-Specific Scoring Matrix
 Calculate the absolute frequency of
  each nucleotide at each position
 seq 1    GAGGTAAAC       1   2   3   4    5    6   7   8   9
 seq 2    TCCGTAAGT   A   3   6   1   0    0    6   7   2   1
 seq 3    CAGGTTGGA   C   2   2   1   0    0    2   1   1   2
 seq 4    ACAGTCAGT   G   1   1   7   10   0    1   1   5   1
 seq 5    TAGGTCATT   T   4   1   1   0    10   1   1   2   6
 seq 6    TAGGTACTG
 seq 7    ATGGTAACT
 seq 8    CAGGTATAC
 seq 9    TGTGTGAGT
 seq 10   AAGGTAAGT
Position-Specific Scoring Matrix
 Calculate the relative frequency of
  each nucleotide at each position
 seq 1    GAGGTAAAC               1   2   3   4    5    6       7       8       9
 seq 2    TCCGTAAGT       A       3   6   1   0    0    6       7       2       1
 seq 3    CAGGTTGGA       C       2   2   1   0    0    2       1       1       2
 seq 4    ACAGTCAGT       G       1   1   7   10   0    1       1       5       1
 seq 5    TAGGTCATT       T       4   1   1   0    10   1       1       2       6
 seq 6    TAGGTACTG
 seq 7    ATGGTAACT
 seq 8    CAGGTATAC
 seq 9    TGTGTGAGT           1       2   3   4     5       6       7       8       9
 seq 10   AAGGTAAGT   A
                      C
                      G
                      T
Position-Specific Scoring Matrix
 Calculate the relative frequency of
  each nucleotide at each position
 seq 1    GAGGTAAAC               1    2     3    4    5    6       7       8         9
 seq 2    TCCGTAAGT       A       3    6     1    0    0    6       7       2         1
 seq 3    CAGGTTGGA       C       2    2     1    0    0    2       1       1         2
 seq 4    ACAGTCAGT       G       1    1     7    10   0    1       1       5         1
 seq 5    TAGGTCATT       T       4    1     1    0    10   1       1       2         6
 seq 6    TAGGTACTG
 seq 7    ATGGTAACT
 seq 8    CAGGTATAC
 seq 9    TGTGTGAGT           1       2     3     4     5       6       7       8         9
 seq 10   AAGGTAAGT   A       0.3     0.6   0.1   0     0   0.6     0.7         0.2       0.1
                      C       0.2     0.2   0.1   0     0   0.2     0.1         0.1       0.2
                      G       0.1     0.1   0.7   1     0   0.1     0.1         0.5       0.1
                      T       0.4     0.1   0.1   0     1   0.1     0.1         0.2       0.6
Position-Specific Scoring Matrix
 What is the probability of finding
  CAGGTTGGA?
   The product of the frequency of each
    nucleotide at each position:
   C is 0.2 at position 1, A is 0.6 at position
    2, etc -> 0.2 * 0.6 * 0.7 * 1 * 1 * 0.1 *
    0.1 * 0.5 * 0.1      1  2  3  4  5  6  7  8                       9

                      A   0.3   0.6   0.1   0   0   0.6   0.7   0.2   0.1

                      C   0.2   0.2   0.1   0   0   0.2   0.1   0.1   0.2

                      G   0.1   0.1   0.7   1   0   0.1   0.1   0.5   0.1

                      T   0.4   0.1   0.1   0   1   0.1   0.1   0.2   0.6
HMM (hidden Markov model)
HMMs and their Usage
 HMMs are very common in
  Computational Linguistics:
   Speech recognition (observed: acoustic
    signal, hidden: words)
   Handwriting recognition (observed: image,
    hidden: words)
   Part-of-speech tagging (observed: words,
    hidden: part-of-speech tags)
   Machine translation (observed: foreign
    words, hidden: words in target language)
Hidden Markov Model (HMM)
 HMMs allow you to estimate
  probabilities of unobserved events
 Given plain text, which underlying
  parameters generated the surface
 E.g., in speech recognition, the
  observed data is the acoustic signal
  and the words are the hidden
  parameters
Markov Chains
 Given a finite discrete set S of
  possible states, a Markov chain
  process occupies one of these states
  at each unit of time.
 The process either stays in the same
  state or moves to some other state in
  S.
 This occurs in a stochastic way,
  rather than in a deterministic one.
A simple example
 Consider a 3-state Markov model of the weather. We
  assume that once a day the weather is observed as
  being one of the following: rainy or snowy, cloudy,
  sunny.
 We postulate that on day t, weather is characterized
  by a single one of the three states above, and give
  ourselves a transition probability matrix A given by:

                   0 .4 0 .3 0 .3 
                                    
                   0 .2 0 .6 0 . 2 
                   0 . 1 0 . 1 0 .8 
                                    
 Given that the weather on day 1 is
  sunny, what is the probability that
  the weather for the next 7 days will
  be    “sun-sun-rain-rain-sun-cloudy-
  sun”?
Hidden?
 What if each state does not
  correspond    to  an observable
  (physical) event?
The Structure of a Profile HMM



                     Squares: main states
                     Diamonds: insert states
                     Circles: delete states,
                             silent states
A Hidden Markov Model


                              insertion node




   node 1   node 2   node 3     node 4     node 5   node 6
First Three and Last Three Columns

 Column 1: 4 A’s and 1 T
   probability for A is 0.8
   probability for T is 0.2

                A C A - - - A T G
                T C A A C T A T C
                A C A C - - A G C
                A G A - - - A T C
                A C C G - - A T C
                                A C A - - - A T G
                                T C A A C T A T C
                                A C A C - - A G C
Insertions                      A G A - - - A T C
                                A C C G - - A T C

 Columns 4, 5, 6 are the insertions
 At the fourth column, 3 out of 5
  sequences have insertions
   the probability of transition from the
    third node to the insertion node is 0.6
 In the insertion node, 1 A, 2 C’s, 1 G,
  1T
   the probabilities of A, C, G, T are 0.2,
    0.4, 0.2, 0.2
Two Uses of a Markov Model

 Generate sequences according to the
  probabilities

 Compute the probability of a sequence
A Markov Model Generating
Random DNA Sequences


           A      C



   begin                    end


           G      T
A Good Introduction to HMM
 The examples in the following slides are
  taken from:
 An introduction to hidden Markov models
  for biological sequences
 Anders Krogh
 In Computational Methods in Molecular
  Biology, edited by S. L. Salzberg, D.B.
  Searls, and S. Kasif, pages 45-63, Elsevier,
  1998
 http://www.binf.ku.dk/users/krogh/publicat
  ions/ps/Krogh98a.pdf

				
DOCUMENT INFO
Tags:
Stats:
views:9
posted:3/1/2013
language:
pages:34