# PairSeqAlgorithms.ppt - Marquette University

Document Sample

```					Algorithms for Pairwise
Sequence Alignment

Craig A. Struble, Ph.D.
Marquette University
Overview
   Pairwise Sequence Alignment
   Dynamic Programming Solution
   Global Alignment
   Local Alignment

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   2
Goals
   Define the pairwise sequence alignment
problem
   Understand the difference between global
and local alignment
   Understand dot matrix analysis
   Introduce and understand dynamic
programming and its application to pairwise
sequence alignment

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   3
Pairwise Sequence Alignment
   Problem
   Given two sequences (DNA or AA), “line
them up” in a biologically meaningful way.

HEAGAWGHE-E
HEAGAWGHEE
PAWHEAE                                            P-A--W-HEAE

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   4
Origins Of Similar Sequences
2
a           duplication

a1            a2                               1
speciation
a1    a2         a1 a2                              duplication

Species 1        Species 2                                                         2
2

1                                                           1

Transfer
Convergence
BIIN 200: Bioinformatics I - Pairwise Sequence Alignment       5
Why is comparing sequences
important?
   One of the fundamental phenomena explored by
bioinformatics, around which many tools are built
   Databases, data selection, etc.
   Researchers compare sequences in order to:
   infer the function of genes
   infer the structure of genes and gene products
   infer the evolutionary history of genes and organisms
   identify variation responsible for disease and other complex
phenotypes

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   6
Why is this a challenging
problem?
   Similar sequences contain variation
   Sequences mutate over time
   Mutations are spontaneous changes in
sequence caused by replication (or other)
errors. Mutation rates vary, and can be
influenced by many factors.
   Sequence data contains errors
   Sequencing techniques are imperfect

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   7
Four Basic Types of Mutations
A. Substitution:                             C. Insertion
Thr Tyr Leu Leu                               Thr Tyr Leu Leu
ACC TAT TTG CTG                               ACC TAT TTG CTG

ACC TCT TTG CTG                               ACC TAC TTT GCT G--
Thr Ser Leu Leu                               Thr Tyr Phe Ala

B. Deletion                                   D. Inversion
Thr Tyr Leu Leu                                Thr Tyr Leu Leu
ACC TAT TTG CTG                                ACC TAT TTG CTG

ACC TAT TGC TG-                                ACC TTT ATG CTG
Thr Tyr Cys                                    Thr Phe Met Leu
BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   8
Influences on Variation
   Rates of mutations are influenced by:
   Substitution class (transition/transversion)
   Coding site (synonymous/nonsynonymous)
   Length of insertion/deletion
   Codon usage bias
   Nucleotide consist (GC content)
   Stability & fate of variation depends upon:
   Drift
   Selection (positive Darwinian/purifying, sexual, artificial)
   Other mutations (reversions are not uncommon)

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   9
Homology vs. Similarity
   Homology is a discrete state pertaining to relatedness
- two genes are homologues if and only if they share
a commone gene ancestor
   Orthologues: in different organisms, a result of speciation
   Paralogues: in the same organism, a result of gene
duplication
   Homologues may have the same, similar, or different
functions
   Similarity is a continuous state describing the degree
of to which two homologues share characteristics
   Generally a percentage
   Distance estimates are also estimates of similarity

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   10
Kinds of Alignments
   The local alignment includes only regions of identity (or
strong similarity). The favors finding conserved regions.
   The global alignment is stretched over the entire sequence
length, including as many matches as possible.

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   11
When do you choose local vs.
global?
   Choose local alignment when
   DNA sequences encode genes with introns
   Amino acid sequences encoding proteins
   Choose a global alignment when
   Sequences can be seen to be very similar
   Similar regions are in the same order and
orientation

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   12
Methods Of Sequence
Alignment
   Dot matrix analysis
   Dynamic programming algorithms
   Word or k-tuple methods
   BLAST, FASTA
   Discussed later in the semester

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   13
Dot Matrix Analysis
   Visualization of sequence similarity
   First technique to use on pairs of
sequences
   Insertions/deletions
   Inverted repeats
   Does not show actual alignment
   Optimal alignment not obvious

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   14
Simple Dot Matrix Example:
For sequences:
a) ATGCGTCGTT
A T G C G T C G T T
b) ATCCGCGAT                            A
Steps                                    T
1. Arrange sequences on a                C
matrix
2. Place a dot anywhere                  C
nucleotides match                    G
3. Diagonal stretches (here              C
indicated by a line) are areas       G
of alignment                         A
4. More than one area of
alignment can appear                 T

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   15
DNA sequence matrix: Noisy
   Sequence alignment of 2
long DNA sequences
   Many random matches make
it difficult or impossible to
find areas of alignment
   Using a window & stringency             Quic kTim e™ and a TIFF (Unc ompress ed) decompres sor are needed to see this pic ture.

setting, we can eliminate
some of the noise

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment                                                           16
DNA sequence matrix: Less noisy
   To decrease noise of
random matches, a
window of 11 nucleotides
was defined, and a dot
placed when at least 7
matches occur                        Quic kTim e™ and a TIFF (Unc ompressed) dec ompress or are needed to see this picture.

   Window = 11, Stringency
=7
   Some diagonal lines
begin to appear

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment                                                     17
DNA sequence matrix: Less noisy
   To decrease noise of random
matches, a window of 23
nucleotides was defined, and a dot
placed when at least 15 matches
occur
   Window = 23, Stringency = 15
   A clear diagonal line appears,            Quic kTim e™ and a TIFF (Unc ompress ed) decom press or are needed to see this picture.

indicating an area of alignment
   A few other areas are still apparent
- probably long random matches

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment                                                   18
Protein sequence matrix: Noisy
   Sequence comparison of
amino acid sequence (same
gene as previous example)
   Window = 1, stringency = 1
   To decrease noise due to               Qu i ckTi me ™ a nd a TIFF (Un co mp re ss ed ) de co mp re ss o r a re ne ed ed to se e th is pi c tu re.

random matches, conditions
can be tightened

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment                                                               19
Protein sequence matrix: Less
noisy
   Same sequence
comparison, tighter
analysis conditions
   Window = 3, stringency             Qu i ckTi me ™ a nd a TIFF (Unc o mpre ss ed ) d e co mpre ss or a re ne ed ed to se e thi s pi ctu re .

=2
   A single aligned region
is visible, with a number
of areas of random
matches

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment                                                             20
Evidence of repeats in a DNA
sequence
Window 1, stringency 1                                                               Window 23, stringency 7

QuickTime™ and a TIFF (Uncompres sed) decompressor are needed to see this picture.
Quic kTim e™ and a TIFF (Unc ompress ed) decompres sor are needed to see this picture.

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment                                                     21
Programs for Dot Matrix
Analysis
   DNA Strider (Macintosh)
   Dotter (Unix/Linux, X-Windows)
   In the lab
   DOT plots in EMBOSS
   In the lab
   PLALIGN (FASTA)
   Plots alignments found by DP method
   Dotlet
   http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   22
Optimal Sequence Alignments
   Example
HEAGAWGHEE
PAWHEAE

HEAGAWGHE-E                                   HEAGAWGHE-E
P-A--W-HEAE                                   --P-AW-HEAE

   Which one is better?

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   23
Scoring
   To compare two sequence alignments,
calculate a score
   Scoring matrix
   Provide a score for each match/mismatch
   Sometimes a mismatch is acceptable
   PAM, BLOSUM are two classes of scoring matrices
   Gap penalty
   Initiating a gap
   Gap extension penalty
   Extending a gap
BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   24
Scoring Matrix Example
A      E    G    H    W          • Gap penalty: -8
A   5      -1   0    -2   -3
• Gap extension: -4
E   -1     6    -3   0    -3
H   -2     0    -2   10   -3
HEAGAWGHE-E
P   -1     -1   -2   -2   -4
W   -3     -3   -3   -3   15               --P-AW-HEAE
(-8) + (-4) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 13

HEAGAWGHE-E
Exercise: Calculate for
P-A--W-HEAE

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   25
Formal Description
   Problem: PairSeqAlign
   Input: Two sequences        x,y
Scoring matrix         s
Gap penalty           d
Gap extension penalty e

   Output: The optimal sequence alignment

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   26
How Difficult Is This?
   Consider two sequences of length n
   There are
 2n  (2n)!   22n
 
 n  (n!) 2  n 
 
possible global alignments, and we
need to find an optimal one from
amongst those!

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   27
So what?
    So at n = 20, we have
over 120 billion possible
alignments
    We want to be able to
align much, much
longer sequences
   Some proteins have 1000
amino acids
   Genes can have several
thousand base pairs

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment    28
Dynamic Programming
   General algorithmic development
technique
   Reuses the results of previous
computations
   Store intermediate results in a table for
reuse
   Look up in table for earlier result to
build from

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   29
Global Alignment
   Needleman-Wunsch 1970
   Idea: Build up optimal alignment from optimal
alignments of subsequences
   Three ways to align x1..i with y1..j
Extend both strings                                   xi already aligned,
at the same time                                      align yj with a gap
IGAxi                     AIG Axi                           GAxi--
LGVyj                     GVyj--                            SLG Vyj
align xi with a gap
BIIN 200: Bioinformatics I - Pairwise Sequence Alignment    30
Global Alignment
   Notation
   xi – ith letter of string x
   yj – jth letter of string y
   x1..i – Prefix of x from letters 1 through I
   F – matrix of optimal scores
   F(i,j) represents optimal score lining up x1..i
with y1..j
   d – gap penalty
   s – scoring matrix

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   31
Global Alignment
   The work is to build up F
   Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd
   Fill from top left to bottom right using the
recursive relation
 F (i  1, j  1)  s ( xi , y j )

F (i, j )  max       F (i  1, j )  d
       F (i, j  1)  d


BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   32
Global Alignment
yj aligned to gap

F(i-1,j-1)                     F(i,j-1)
s(xi,yj)                         d

F(i-1,j)                       F(i,j)
xi aligned to gap                      d

While building the table, keep track of where
optimal score came from, reverse arrows

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   33
Example
H     E     A      G     A     W     G      H     E       E

0      -8    -16   -24    -32   -40   -48   -56    -64   -72     -80

P   -8     -2    -9    -17    -25   -33   -42   -49    -57   -65     -73

A   -16

W   -24

H   -32

E   -40

A   -48

E   -56

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment         34
Completed Table
H     E     A      G     A     W     G      H     E       E

0      -8    -16   -24    -32   -40   -48   -56    -64   -72     -80

P   -8     -2    -9    -17    -25   -33   -42   -49    -57   -65     -73

A   -16    -10   -3    -4     -12   -20   -28   -36    -44   -52     -60

W   -24    -18   -11   -6     -7    -15   -5    -13    -21   -29     -37

H   -32    -14   -18   -13    -8    -9    -13   -7     -3    -11     -19

E   -40    -22   -8    -16    -16   -9    -12   -15    -7    3       -5

A   -48    -30   -16   -3     -11   -11   -12   -12    -15   -5      2

E   -56    -38   -24   -11    -6    -12   -14   -15    -12   -9      1

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment         35
Traceback
H     E     A       G      A     W     G      H     E     E
Trace arrows back
from the lower right
0     -8    -16   -24     -32    -40   -48   -56    -64   -72   -80    to top left
• Diagonal – both
P   -8    -2    -9    -17     -25    -33   -42   -49    -57   -65   -73
• Up – upper gap
A   -16   -10   -3    -4      -12    -20   -28   -36    -44   -52   -60          • Left – lower gap

W   -24   -18   -11   -6      -7     -15   -5    -13    -21   -29   -37

H   -32   -14   -18   -13     -8     -9    -13   -7     -3    -11   -19

E   -40   -22   -8    -16     -16    -9    -12   -15    -7    3     -5

A   -48   -30   -16   -3      -11    -11   -12   -12    -15   -5    2
HEAGAWGHE-E
E   -56   -38   -24   -11     -6     -12   -14   -15    -12   -9    1
--P-AW-HEAE

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment           36
Summary
   Uses recursion to fill in intermediate
results table
   Uses O(nm) space and time
   O(n2) algorithm
   Feasible for moderate sized sequences, but
not for aligning whole genomes.

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   37
Local Alignment
   Smith-Waterman (1981)
   Another dynamic programming solution
               0
 F (i  1, j  1)  s ( x , y )

                         i   j
F (i, j )  max       F (i  1, j )  d
       F (i, j  1)  d




BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   38
Example
H     E     A      G     A     W     G      H     E       E

0    0     0     0      0     0     0     0      0     0       0

P   0    0     0     0      0     0     0     0      0     0       0

A   0    0     0     5      0     5     0     0      0     0       0

W   0    0     0     0      2     0     20    12     4     0       0

H   0    10    2     0      0     0     12    18     22    14      6

E   0    2     16    8      0     0     4     10     18    28      20

A   0    0     8     21     13    5     0     4      10    20      27

E   0    0     6     13     18    12    4     0      4     16      26

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment        39
Traceback
H    E    A     G     A     W      G     H     E     E      Start at highest score
and traceback to first 0
0   0    0    0     0     0     0      0     0     0     0

P   0   0    0    0     0     0     0      0     0     0     0

A   0   0    0    5     0     5     0      0     0     0     0

W   0   0    0    0     2     0     20     12    4     0     0                    AWGHE
H   0   10   2    0     0     0     12     18    22    14    6                    AW-HE

E   0   2    16   8     0     0     4      10    18    28    20

A   0   0    8    21    13    5     0      4     10    20    27

E   0   0    6    13    18    12    4      0     4     16    26

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment           40
Summary
   Similar to global alignment algorithm
   For this to work, expected match with
random sequence must have negative score.
   Behavior is like global alignment otherwise
   Similar extensions for repeated and overlap
matching
   Care must be given to gap penalties to
maintain O(nm) time complexity

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   41
Scoring Matrices
   Substitutions
   Models of substitutions
   PAM
   BLOSUM
   Gap penalties

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   42
DNA

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   43
Transitional and Transversional Nucleotide
Substitutions
•  &  are rates of transitional and
Pyrimidines                 transversional substitutions, respectively
C                           T       • Generally,  > 
                     • Possible substitutions (total = 16):
•Identical (freq = O): 4
•Transitions (P): 4
                    
•Transversions (Q): 8
• Giving us:
•p=P+Q
                          • R = P/Q
A                           G            • R is usually between 0.5 and 2 for nuclear
Purines                         genes, higher for mitochondrial genes (up to
15)

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   44
Synonymous and Non-
synonymous substitutions
Synonymous                                   Non-synonymous
Thr Tyr Leu Leu                                  Thr Tyr Leu Leu
ACC TAT TTG CTG                                  ACC TAT TTG CTG

ACC TAC TTG CTG                                  ACC TCT TTG CTG
Thr Tyr Leu Leu                                  Thr Ser Leu Leu

   Synonymous substitutions more likely to occur
   Preserve AA

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   45
Categories of Amino Acids
Basic       Acidic               Polar                Nonpolar

Lys         Asp                  Ser                  Gly       Ile
Arg         Glu                  Thr                  Ala       Pro
His                              Tyr                  Val       Cys
Asn                  Leu       Met
Gln                  Phe       Trp

Grouped according to properties of side chain
BIIN 200: Bioinformatics I - Pairwise Sequence Alignment     46
Amino Acid Substitutions
   Tend to preserve chemical similarity
   Tend to preserve structure
   Tend to preserve function
   More frequent in non-functional
domains

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   47
Models of Substitution
   Percept Accepted Mutation (PAM)
   Dayhoff 1978
   “Accepted Mutation” changes accepted by natural selection
   PAM1 represents evolutionary divergence where 1% of
amino change
   Blocks Amino Acid Substitution Matrices (BLOSUM)
   Henikoff and Henikoff 1992
   Observed AA substitutions in conserved AA blocks
   Maximum level of identity, BLOSUM62 represents 62%
identity

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   48
PAM
     Markov model                                  Probability of
pst=pts
transitioning from
S                   T             one state to
another
C
…
State for
amino                  P
acid

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   49
PAM
   Assumes substitutions are independent
   pxy is calculated from observations
   1572 changes in 71 groups of proteins
   Organized into phylogenetic trees
   Changes counted
   Divided by normalizing factor
   The probabilities are stored in a matrix
   Probability form
   PAM1 represents 10 my evolutionary distance
   PAMN is derived from PAM1N because Markov Model is used

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   50
PAM1 for DNA
Uniform Model
A          G       T     C
A        0.99
G    0.00333        0.99
C             T                       T    0.00333 0.00333        0.99
C    0.00333 0.00333 0.00333       0.99
0.00333

Higher Transitions
A         G           T       C
A             G                  A         0.99
0.00333                      G        0.006      0.99
T        0.002     0.002        0.99
0.99                                      C        0.002     0.002      0.006     0.99

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment           51
BLOSUM
   ~2000 conserved amino acid patterns
   blocks ungapped patterns
   3-60 AA long
   >500 families of related proteins
   Software
   MOTIF (H. Smith et al. 1990)
   PROTOMAT (Henikoff and Henikoff)

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   52
Computing BLOSUM Scores
   Consider all pairs (don’t know ancestor)
…A…           fAA=3+2+1=6; fAL=4; fAS=4; fLS=1
…L…       Calculate frequency of occurrence
…A…           qAA=fAA/(fAA+fAL+fAS+fLS) = 0.4
…S…
…A…       Calculate expected frequency of being in a pair
…A…           pA=(qAA+qAS/2+qAL/2)=0.66
   Calculate expected frequency of a pair
   eAA=pA*pA=0.44
   Matrix entry for pair
   mAA = qAA/eAA = 0.9

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   53
Log Odds Scoring
   Each of the previous matrices are converted to log
odds matrices
   DP algorithm based on addition
   log(xy)=log(x)+log(y)
   Compares real occurrence with random occurrence.
   BLOSUM
   sAA=log2(qAA/eAA) * 2 = -0.304 (will be rounded)
   PAM1 DNA (uniform)
   sCT = log2(pCMCT / pCpT)
= log2(0.25 * 0.00333/ 0.252)
= -6.23

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   54
The PAM250 Matrix
Amino acid group:
sulfhydryl

small hydrophilic

acid, acidamide and hydrophilic
QuickTime™ and a TIFF (Uncompressed) decompress or are needed to see this picture.

basic

small hydrophobic

aromatic

Note:                                Each matrix value is calculated by first dividing the frequency of
•High values on diagonal             change, for each amino acid pair, in related proteins separated by one
•High values for similar groups      step in an evolutionary tree by the probability of a chance alignment
based on the frequency of the amino acids. The ratios are expressed as
logarithms to the base 10 (approx. 1/3 bit values).

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment                                       55
The Blosum62 Matrix
Amino acid group:
sulfhydryl

small hydrophilic

acid, acidamide and hydrophilic             QuickTime™ and a TIFF (Uncompres sed) decompressor are needed to see this picture.

basic

small hydrophobic

aromatic

Each entry is the actual frequency of occurrence of the amino acid pair in the blocks database, clustered at the
62% level, divided by the expected probability of occurrence. The expected value is calculated from the
frequency of occurrence of each of the two individual amino acids in the blocks database,and provides a
measure of a chance alignment of the two amino acids. The actual/expected ratio is expressed as a log odds. A
zero score means that the frequency of the amino acid pair in the database was as expected by chance, a
positive score that the pair was found more often than by chance, and a negative score that the pair was found
less often than by chance.

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment                                          56
Selecting Matrices
   PAM
   Mutational model of evolution
   Tracks evolutionary origins of proteins/sequences
   Use lower numbers for evolutionarily close sequences,
higher numbers for distance sequences
   BLOSUM
   No model of evolution, conserved AA motifs
   Designed to find conserved domains
   Similar sequences, use higher numbers,
   Divergent sequences, use lower numbers.

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   57
GAP Penalties
   Recall d is gap opening penalty, e is gap extension
penalty
   Total gap penalty wx=d+e(x-1)
   In order to make things work properly, need affine
gap function (Smith et al. 1981)
   wx ≤ dx
   Any affine function works
   For the linear function above, e ≤ d
   Typical gap penalties (Mount p.142)
   BLOSUM50 d=15, e=8-15
   PAM250 d=15, e=5-15

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   58
Summary and Conclusions
   Similar sequences arise naturally
   Pairwise sequence alignment used to
compare similarity of two sequences
   Dot matrix analysis is a visual technique for
sequence alignment
   Dynamic programming is used for global and
local alignments
   Scoring matrices based on biological
assumptions

BIIN 200: Bioinformatics I - Pairwise Sequence Alignment   59

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 7 posted: 12/10/2012 language: Unknown pages: 59