Global Multiple Sequence Alignment
Shared by: ewghwehws
-
Stats
- views:
- 7
- posted:
- 3/4/2012
- language:
- English
- pages:
- 40
Document Sample


Multiple Sequence Alignment
Workshop on Developing Bioinformatics Programs
July, 2008
Hugh B. Nicholas Jr.
nicholas@psc.edu
Pittsburgh Supercomputing Center
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
1
Sequence Analysis - Overview
Get New Sequence
Get Sequences
Literature Search
from Database
Single Sequence Analysis
Database Search
Multiple Sequence Alignment
Database Search for Examine Find Distinctive Integrate
Distantly Related Conserved Subfamily Patterns with
Sequences Patterns Patterns Structure
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
2
Focus: What to Remember
What kinds of information are contained in the alignment.
What is revealed by patterns of conservation within the
alignment.
How Multiple Sequence Alignments are scored.
Major types of Global Multiple Sequence Alignment
Algorithms.
Limitations of each major type of alignment algorithm.
Features that may indicate uncertainty in the alignment.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
3
Global Multiple Sequence Alignment
Shows the relationships among a family of
sequences.
Detailed information about a family of genes and
their protein products.
Works on the entire length of the sequences.
Assumes that the sequences are homologous.
All available programs have both theoretical and
practical limitations.
Additional knowledge from structure, experiments,
or other computations can be used to guide editing
of the initial alignment.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
4
Multiple Sequence Alignment Problem
Unobserved Ancestral Sequence
AGCATGATGCGC
AGCCTCATCTCA
AGCCTG...CGC
ACT...ACATTG
Unobserved Unobserved
Descendant Descendant
AGCATGATGCGC AGCCTGCGC ACTACATTG AGCCTCATCTCA
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
5
Multiple Sequence Alignment
140 * 160 * 180 * 200
bovbpi : KWK---ARKNFIKLGGNFDLSVEGISILAGLNLGYDPASGHSTVTCSSCSSHINSVHVHISKSK-VG : 178
humbpi : KWK---AQKRFLKMSGNFDLSIEGMSISADLKLGSNPTSGKPTITCSSCSSDIADVEVDMSGD--SG : 182
humlbpa : RWK---VRKSFFKLQGSFDVSVKGISISVNLLLGSES-SGRPTGYCLSCSSDIQNVELDIEGD--LE : 175
rablpb : SWK---VRKAFLRLKNSFDLYVKGLTISVHLVLGSES-SGRPTVTTSSCSSRIRDLELHVSGN--VG : 176
ratlbp : KWK---VRRSFVKLHGSFDLDVKSVTISVDLLLGVDP-SERPTVTASGCSNSFHKLLLHLQGEREPG : 177
humcetp : TLKYGYTTAWWLGIDQSIDFEID-SAIDLQINTQLTCDSGRVRTDAPDCYLSFHKLLLHLQGEREPG : 178
maccetp : TLKYGYTTAWGLGIDQSVDFEID-SAIDLQINTQLTCDSGRVRTDAPDCYLAFHKLLLHLQGEREPG : 178
rabcetp : TLNYSYTSAWGLGINQSVDFEID-SAIDLQINTELTCDAGSVRTNAPDCYLAFHKLLLHLQGEREPG : 162
hupltp : RRQ---LLYWFFYDGGYINASAEGVSIRTGLELSRDP-AGRMKVSNVSCQASVSRMHAAFGGT--FK : 162
mupltp : RRQ---LLYWFLYDGGYINASAEGVSIRTGLQLSQDS-SGRIKVSNVSCEASVSKMNMAFGGT--FR : 162
rrrya3 : SGP----------LVGLLQLAAE-VNVSSKVALGMSP-RGTPILILKRCNT----LLGHISLT--SG : 173
rry2g5 : KS-----------LIGFLDIAVE-VNITAKVRLTMDR-TGYPRLVIERCDT----LLGGIKVKLLRG : 164
g2599572 : AP------------LHTVPMPVR-ISIRADLHVDMGP-DGNLQLLTSACRP-----TVQAQST---- : 139
i g C
* 220 * 240 * 260
bovbpi : WLIQLFHKKIESALRNKMNSQVCEKVTNSVSSKLQPYFQTLPVMTKLDKVAGVDYSLVAPPRATANN : 245
humbpi : WLLNLFHNQIESKFQKVLESRICEMIQKSVSSDLQPYLQTLPVTTKIDSVAGINYGLVAPPATTAET : 249
humlbpa : ELLNLLQSQIDARLREVLESKICRQIEEAVTAHLQPYLQTLPVTTEIDSFADIDYSLVEAPRATAQM : 242
rablpb : WLLNLFHNQIESKLQKVLESKICEMIQKSVTSDLQPYLQTLPVTTQIDSFAGIDYSLMEAPRATAGM : 243
ratlbp : WIKQLFTNFISFTLKLVLKGQICKEI-NVISNIMADFVQTRAASADIDTILGIDYSLVAAPQAKAQT : 243
humcetp : WIKQLFTNFISFTLKLVLKGQICKEI-NIISNIMADFVQTRAASILSDGDIGVDISLTGDPVITASY : 244
maccetp : WLKQLFTNFISFTLKLILKRQVCNEI-NTISNIMADFVQTRAASILSDGDIGVDISLTGDPIITASY : 244
rabcetp : WLKQLFTNFISFTLKLILKRQVCNEI-NTISNIMADFVQTRAASILSDGDIGVDISVTGAPVITATY : 228
hupltp : KVYDFLSTFITSGMRFLLNQQICPVLYHAGTVLLNSLLDTVPVRSSVDELVGIDYSLMKDPVASTSN : 229
mupltp : RMYNFFSTFITSGMRFLLNQQICPVLYHAGTVLLNSLLDTVPVRSSVDDLVGIDYSLLKDPVVSNGN : 229
rrrya3 : LLPTPIFGLVEQTLCKVLPGLLCPVV-DSVLSVVNELLGATLSLVPLGPLGSVEFTLATLPLISNQY : 239
rry2g5 : LLPNLVDNLVNRVLANVLPDLLCPIV-DVVLGLVNDQLGLVDSLVPLGILGSVQYTFSSLPLVTGEF : 230
g2599572 : -REAESKSSRSILDKVVDVDKLCLDV-SKLLLFPNEQLMSLTALFPVTPNCQLQYLALAAPVFSKQG : 204
i l C t d l P
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
6
Multiple Sequence Alignment
Our best guess at the detailed evolutionary
history of a family of related genes.
Speciation events
Gene duplications
Mutations of nucleotides and selection of amino
acids
Insertion and deletion events
Given a method of computation, it implies a
specific phylogeny of the genes and species.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
7
Multiple Sequence Alignment
The pattern of variation and conservation at each
position within the alignment provide information about
the structural and functional role of that position in the
gene or protein.
Highly conserved positions are likely to be critical to the
structure or function of the molecule.
Positions that vary systematically (limited variation) between
defined subgroups may reflect the evolution of new functions
or the structural variation required to support a new function.
Highly variable positions provide scaffolding or filler.
Conserved hydrophobic residues are probably important to
structure.
Conserved polar residues may be important for catalytic or
functional interactions.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
8
Three Dimensional Alignment Space
2 diagonal (j,k)
sequencei
3-diagonal
(i,j,k)
2-diagonal (i,j)
sequencej
2-diagonal (i,k)
sequencek
A three dimensional alignment space (for aligning three
sequences) showing the two and three dimensional
path graphs associated with the alignments.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
9
DISTANCE and Analyzing Sequences
Distance is a dissimilarity measure that has four
defining properties. If d(a, b) is the distance
between sequence a and sequence b. Then
d(a, b) >= 0.
d(a, b) = 0, only if sequence a and sequence b are the
same sequence.
d(a, b) = d(b, a).
d(a, b) + d(a ,c) >= d(b, c).
This fourth property, called the triangle inequality, is
particularly important if we wish to evaluate the
relationship among more than two sequences at a
time.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
10
The Triangle Inequality
d(a, b) d(a,c) d(b, c)
Is the algebraic equivalent of the Euclidean
postulate that three sides form a triangle.
It allows us to construct a map.
A map is the simultaneous representation of the
relationships among three or more objects.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
11
d(a, b) d(a,c) d(b, c)
If true false
d(a,b)
d(a,c)
d(b,c)
A A cannot be placed
on the map.
d(a,b) d(a,c) d(a,b) d(a,c)
B C
B C
d(b,c) d(b,c)
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
12
Alignment Scoring Methods
Distance between sequences
Seq_1 Seq_2 Seq_3 Seq_4
Sequences
Seq_1 A Seq_1 0 0 1 1
Seq_2 A Seq_2 0 0 1 1
Seq_3 C Seq_3 1 1 0 0
Seq_4 C Seq_4 1 1 0 0
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
13
Evolutionary Score = 1
Seq_2 A Seq_3 C
A C
Seq_1 A Seq_4 C
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
14
Star Score = 2
Seq_2 A Seq_3 C
A
Seq_1 A Seq_4 C
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
15
Sum of Pairs Score = 4
Seq_3 C
Seq_2 A
Seq_1 A Seq_4 C
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
16
Multiple Sequence Alignment Algorithms
Pairwise Progressive method
Aligns a pair of sequences and successively adds more
sequences to the extant alignment.
Deals with many sequences very rapidly.
Local minima problems.
Multiple Dimensional Dynamic Programming
Needleman-Wunsch with more than two sequences.
Exact, best solution for specific scoring scheme.
Requires a lot of memory, thus can align only a few sequences.
Consistency
Custom scores based on how often particular residues are
aligned in a wide ranging set of pairwise alignments.
Incorporates both global and local alignment information
Can incorporate other kinds of information from other kinds of
alignments such as structural superpositions.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
17
Progressive Pairwise Alignment
ClustalW , PileUp, Multalign
Five peptides and a tree showing their relative overall similarity
itcg itck ltscg ktcsg itctd
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
18
Begin with the alignment of the most similar
pair of sequences
itcg
itck } itc(k,g)
itcg itck ltscg ktcsg itctd
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
19
Progressively align the most similar sequences
(i,l,k)t(-,s,c)(t,s,c)(g,k,d)
(i,l)t(-,s)c(g,k)
(k,i)tc(t,s)(g,d)
itc(g,k)
itcg itck ltscg ktcsg itctd
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
20
Resulting Alignment
Actual Desired
ktcsg kt-csg
itctd it-ctd
lt-ck lt-c-k
it-cg it-c-g
ltscg ltsc-g
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
21
Progressive Pairwise Alignments
Major Limitations and Strengths
Local Minima problems that result from:
Looking only at a subset of the data at any one time
Cannot change the alignment of sequences that were
aligned early on the basis of sequences introduced later.
Most common error is failure to identify highly
conserved residues because they are put into several
columns of the alignment rather than into a single
column.
May avoid spurious alignment of gaps into too few
locations that is a side effect of sums of pairs scoring.
Very fast, can run on personal computers and does
not require much memory.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
22
3 Dimensional Path Graph
F
L
Q
D
Q
D - - Q - L F
G D N V Q - - -
L - - - Q G L -
D N V Q
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
23
Three Dimensional Alignment Space
2 diagonal (j,k)
i
3-diagonal
(i,j,k)
2-diagonal (i,j)
j
k 2-diagonal (i,k)
A 3-diagonal and three 2-diagonals it contains.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
24
Path Graph of a pairwise alignment
G C T G G A A G G C A T
G
C
A
G
A
G
C
A
C
T
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
25
Path Graph projected from a multiple
sequence alignment
G C T G G A A G G C A T
G
C
A
G
A
G
C
A
C
T
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
26
Area scoring less than the projected alignment
G C T G G A A G G C A T
G
C
A
G
A
G
C
A
C
T These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
27
The projected areas form a three dimensional
volume in alignment space
2-diagonal (j,k)
i
3-diagonal
(i,j,k)
2-diagonal (i,j)
j
k 2-diagonal (i,k)
A 3-diagonal and three 2-diagonals it contains.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
28
Multidimensional Dynamic Programming
Major Limitations and Strengths
Spuriously aligns gaps that are the result of different
insertion or deletion events because of rigorous sum of
pairs scoring.
Finds a well defined, rigorous, sum of pairs optimal
alignment for the region of alignment space that is
examined.
This has a good chance of being the absolute sum of pairs
optimal alignment for the sequences given the set of
scores.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
29
MSA Alignment Strategy
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
30
Seryl tRNA Synthetase Alignments
Name Length Memory CPU time
* 1,000,000 bytes seconds
Sys_Bacsu 425 aa.
Sys_Ecoli 430 aa.
Sys_Human 514 aa. 0.58 1.2
Sys_Yeast 462 aa. 0.70 5.3
Sys_Mycge 417 aa. 17.4 120.4
Sys_Theth 421 aa. 45.7 550.5
Sys_Halma 460 aa. 10,926. 27,818.*
* Alignment less than 20% complete on DEC 8400
All runs used MSA release 2.1
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
31
T-COFFEE (consistency) Algorithm
Generate global pairwise alignments using the
Needleman-Wunsch algorithm from ClustalW
with it’s heuristic rules.
Generate the 10 best nonintersecting local
alignments using the Waterman-Eggert extension
to the Smith-Waterman algorithm.
Generate initial scores from these alignments.
Extend the scores by examining alignments from
triplets of sequences for consistency.
Align all of the sequences using a progressive
pairwise strategy.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
32
Computation Grid for Smith-Waterman
b1
b2
b3 SW3,3 + s(a4,b4) SW4-l,4 + g;
l = 1, 2, or 3
b4
b5
a1 a2 a3 a4 a5 a6
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
33
T-Coffee: Pairwise Alignments
200 200
GT26_SCHMA
GT26_SCHMA
150 150
100 100
50 50
0 0
0 50 100 150 200 0 50 100 150 200
GTM1_HUMAN GTP_HUMAN
200
GTM1_HUMAN
150
Global Alignment
100
Local Alignment 50
0
0 50 100 150 200
GTP_HUMAN
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
34
Consistency Among Three Sequences
Observed alignments between a sequence and two others
GTM1_HUMAN => (187) FEGLEKISAYMKSSRFLPRP (206)
GTP_HUMAN => (170) LDAFPLLSAYV..GRLSARP (187)
and
GTM1_HUMAN => (187) FEGLEKISAYMKSSRFLPR (205)
GT26_SCHMA => (170) LNEFPKLVSFKKCIEDLPQ (188)
Implies an alignment between the two other sequences:
GTP_HUMAN => (170) LDAFPLLSAYV..GRLSAR (186)
GT26_SCHMA => (170) LNEFPKLVSFKKCIEDLPQ (188)
Only part of the implied alignment is actually observed:
GTP_HUMAN => (170) LDAFPLLSAYV (180)
GT26_SCHMA => (170) LNEFPKLVSFK (180)
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
35
T-Coffee Strengths and Weaknesses
Empirically observed to yield highly accurate alignments
on problems where even MSA gives flawed results.
Can solve very large problems -- 153 GSTs.
T-Coffee = 36 hours; ClustalW = 6 minutes; MSA = not feasible
Can be customized to include a wide variety of
information.
Can require a lot of computer resources.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
36
ProbCons Consistency in a Probabilistic,
HMM Framework
Convert Needleman-Wunsch to using the probabilities
that residue xi should be aligned with residue yj when
aligning sequences X and Y rather than log-odds scores.
This is known as the Viterbi algorithm.
This is a dynamic programming algorithm, as is the
Needleman-Wunsch algorithm.
Because it uses probabities rather than log-odds scores it is
easily represented with a Hidden Markov Model of pairwise
alignment.
Each cell of the two-way alignment table cotains the
probability that xi will be juxtaposed with yj in the best
alignment of sequences X and Y.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
37
ProbCons
The consistency information can be straight forwardly computer by
multiplying together appropriate sets of these two dimensional,
pairwise alignment tables or matrices.
P′xy ← 1/{S} * ∑Z PxzPzy
Use these probabilities as scores for computing a progressive
pairwise alignment.
Use iterative refinement, separate alignment into two arbitrary
subalignments and then realign them.
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
38
Accuracy of Alignments
Homstrad Prefab BaliBase
ClustalW 61.15 61.68 42.83
T-Coffee 65.37 69.97 56.10
Probcons 66.41 70.54 58.24
Dialign-T 57.92 62.05 44.59
M-Coffee8 67.75 72.91 62.02
Wallace, IM, O’Sullivan, O, Higgins, DG, Notredame, C. 2006. M-Coffee: combining multiple
sequence alignment methods with T-Coffee. Nucleic Acids Research. 34:1692-1699
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
39
Alignment Strategies
Align all or a large, representative subset with Probcons and T-Coffee.
Align groups of sequences with Probcons and T-coffee and join the
groups with ClustalW.
Make several alignments in ClustalW with different scoring matrices
and gap penalties.
Edit the alignment making use of information from:
Pattern finding programs
Known structures in the family
Known biochemistry
Knowledge of the weaknesses of the alignment programs
Be cautious – the human eye finds false patterns
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
40
Related docs
Other docs by ewghwehws
Control system for dynamoelectric machines with differentially excited fields
Views: 0 | Downloads: 0
Get documents about "