Sequence Alignment Scoring Functions and Affine Gap
Document Sample


Sequence Alignment – Scoring
Functions, N-W and S-W Affine Gap
Penalties
Saurabh Sinha
02/05/2008
Department of Computer Science
University of Illinois Urbana-Champaign
Scribed By: Chandrasekar Ramachandran
Contents
Introduction
Interpretations
Types of Alignments
Techniques for Solving
Dynamic Programming
Probabilistic Methods
Scoring Functions
N-W and S-W Affine Gap Penalties
Introduction
Sequence Alignment:
Ways of Arranging one sequence(DNA,RNA,Protein)
on another to determine whether a region has been
conserved in evolution or has a common evolutionary
origin
Strings of Letters
Matrix Representation:
- G G C C A G G A T T
G G G C C - G G - T T
Interpretations
Mismatches?
Point Mutations: Replacement of a Single Base
Nucleotide
Categorized as Transitions and Transversions
Gaps?
Indels or Insertion/Deletion Mutations
Can produce Frameshift Mutations Unless
Multiple of 3
Introduced in one or both lineages
Interpretations(Contd.)
What about Amino Acids?
Degree of Similarity
Estimates Conservation
If Conservation is Less:
Indicates Region of High Importance
Estimating Similar Functional Roles:
By Assessing Similarity of Base Pairing
Solving Sequence Alignment
Problems
Dynamic Programming
Initialization
Matrix Fill or Scoring
Traceback
Probabilistic Methods
Bayesian Methods for HMM
Likelihood Derivatives and Fisher Scores
Training and Model Comparison
Needleman-Wunsch
Algorithm(Global Alignment)
Scores for Aligned Functions Specified by a
Similarity Matrix
Example:
Sequence 1: -CCGCTTACCTA
Sequence 2: TTCCGCTTATTA
Possible Alignments:
Sequence 1:-CCGCTTACCTA
Sequence 2:-CCGCTTA- - - -
Score Matches,Gaps and Indels Separately
Global Alignment(contd.)
The Scoring Matrix is Called F-Matrix
Each (I,j) entry denoted by Fij
Running Time:
For Sequences of size a and b, O(ab)
Summary:
Initialization: Fill in Base Cases in Topmost Row and
Leftmost Column
Filling Partial Alignments: Traceback:
Trace back to Initial
Pointer Matrix to
get best solution
Smith-Waterman Algorithm(Local
Alignment)
Involving Stretches Shorter than the Entire
Sequence Length
Generally involves Sequences which are
significantly dissimilar
Negative Scoring Matrix Cells are Set to Zero
Backtracking starts at highest scoring cell and
continues to a cell with zero score
Prerequisite: Negative Expectation Score
Scoring Functions - Overview
Given sequences, a number is associated with
each alignment
E.g Matches : +x, Mismatches: -y,Gaps: -z
Scoring Function: (x X #Matches) –(y X #mismatches) –
(z X #Gaps)
Alignment Scores:
Sum of Substitution Scores and Gap Penalties
Residue-Based
Substitution Matrices:
Protein
Evolutionary
Simple Substitution Matrices
Expresses How one Character in a
Sequence Changes with Other Character
States
N X N Matrix where: N=4 for DNA and 20
for Amino Acids
Another way would be to consider A,G as
Purines and T,C as Pyrimidines
Purines less likely to occur than
Pyrimidines
Minimum Entropy Scoring
Function
Minimum Entropy Score:
Sum of Entropy Scores Computed For Each Column
S (mi ) cia log 2 pia
a
Here,
i is a column
cia the counts of letter a at column I
pia the inferred probability
Gap Characters: Residue Symbols
Gap Functions
Gaps More Likely to Occur in Groups
Examples:
Convex Gap Scoring Functions
Affine Gap Functions
Convex Gap Scoring Functions:
Penalties decrease as Gaps Get Longer
(n):for all n, (n + 1) - (n) (n) - (n – 1)
Now F(i,j) = max { F(i-1,j-1) + s(xi,yj)
maxk=0...i-1 F(k,j) – (i-k)
maxk=0...j-1 F(i,k) – (j-k)
Affine Gap Functions
Shortcomings of a general gap penalty function:
Different Penalties for Additional Gaps
Cubic Time for Updating Entries
Example:
First Gap Penalized Differently, Subsequent Gaps Penalized
Linearly
3 Matrices Computed Simultaneously (x) e (x - 1)d; x 1
References
1. http://webcourse.cs.technion.ac.il/236522/Winter2005-2006/ho/WCFiles/tutorial03.ppt
2. http://engr.smu.edu/~saad/courses/cse8354/lectures/lecture6.pdf
3. http://www.bioinfo.org.cn/lectures/index-13.html
4. Needleman, S.B. and Wunsch, Ch.D. (1970) A general method applicable to the search for
similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443-453.
5. Smith, T.F. and Waterman, M.S. (1981) Comparison of Biosequences. Adv. appl. Math., 2, 482-
489.
6. Dayhoff,M.O., Barker,W.C. and Hunt,L.T. (1983) Establishing Homologies in Protein Sequences.
Methods Enzymol., 91, 524-545.
7. Gotoh, O. (1982) An Improved Algorithm for Matching Biological Sequences. J. Mol. Biol., 162,
705-708.
Get documents about "