WHO MSAppt - Multiple Sequence Alignment
Document Sample


Multiple Sequence Alignment
Definition
• Homology: related by descent
• Homologous sequence positions
ATTGCGC ATTGCGC ATTGCGC
ATTGCGC AT-ACGC
A ATACGC
Reasons for aligning sets of
sequences
• Organise data to reflect sequence homology
• Estimate evolutionary distance
• Infer phylogenetic trees from homologous sites
• Highlight conserved sites/regions
• Highlight variable sites/regions
• Uncover changes in gene structure
• Look for evidence of selection
• Summarise information
Alignments help to
Organise
Visualise
Analyze
Sequence Data
The process of aligning sequences
is a game involving playing off gaps
and mismatches
Ways of aligning multiple
sequences
• By hand
• Automated
• Combination
Definition
Optimality criteria: some kind rule or
scoring scheme to help you to decide what
you consider to be the best alignment
Pairwise vs Multiple Sequences
• Pairs of sequences typically aligned using
exhaustive algorithms (dynamic
programming)
– complexity of exhaustive methods is O(2n mn)
n = number of sequences m = sequence length
• Multiple sequence alignment usually
performed using heuristic methods
The Correct Alignment
ATTGCGC ATTGCGC ATTGCGC
ATTGCGC AT-ACGC
A ATACGC
ATTGCGC
ATA-CGC
The Correct Alignment
Correct Correct
according to according to
optimality homology
criteria
Exhaustive Always Not always
methods
Heuristic Not always Not always
methods
• Sequence alignment is easy with
sufficiently closely related sequences
• Below a certain level of identity sequence
alignment may become meaningless
– twilight zone for aa sequences ~ 30%
• In the twilight zone it is good to make use
of additional information if possible (e.g.
structure)
Consensus Sequences
• Simplest Form:
A single sequence which represents the most
common amino acid/base in that position
Y D D G A V - E A L
Y D G G - - - E A L
F E G G I L V E A L
F D - G I L V Q A V
Y E G G A V V Q A L
Y D G G A/I V/L V E A L
Multiple Alignment Formats
e.g. Clustal, Phylip, MSF, MEGA etc. etc.
Clustal Format
CLUSTAL X (1.81) multiple sequence alignment
CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ--------EVLNEN-
CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP--------EVLNEN-
CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSRE--------
CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE--------
CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERK
CAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ------QQHSSSE
CAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ-------------
*:***: **.*.*:* : . :
Phylip Format (Interleaved)
7 100
SOMA_BOVIN MMAAGPRTSL LLAFALLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL
SOMA_SHEEP MMAAGPRTSL LLAFTLLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL
SOMA_RAT_P -MAADSQTPW LLTFSLLCLL WPQEAGAFPA MPLSSLFANA VLRAQHLHQL
SOMA_MOUSE -MATDSRTSW LLTVSLLCLL WPQEASAFPA MPLSSLFSNA VLRAQHLHQL
SOMA_RABIT -MAAGSWTAG LLAFALLCLP WPQEASAFPA MPLSSLFANA VLRAQHLHQL
SOMA_PIG_P -MAAGPRTSA LLAFALLCLP WTREVGAFPA MPLSSLFANA VLRAQHLHQL
SOMA_HUMAN -MATGSRTSL LLAFGLLCLP WLQEGSAFPT IPLSRLFDNA MLRAHRLHQL
AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL
AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL
AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM
AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM
AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDM
AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDV
AFDTYQEFEE AYIPKEQKYS FLQNPQTSLC FSESIPTPSN REETQQKSNL
Phylip Format (Sequential)
3 100
Rat
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG
TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA
Mouse
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT
TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG
Rabbit
ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC
TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG
Mega Format
#mega
TITLE: No title
#Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
#Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
#Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
#Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
#Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
#Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
#Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT
Progressive Multiple Alignment
• Heuristic
• Perform pairwise alignments
• Align sequences to alignments or
alignments to existing alignments (profile
alignments
• Do the alignments in some sensible order
Progressive versus Simultaneous
• speed versus accuracy
• simultaneous methods are capable of
working out an „exact‟ solution to the
problem of multiple sequence alignment
(e.g. NCBI‟s MSA – user interface QAlign)
Iterative methods
• Several progressive alignment methods can
be iterated
– e.g. Barton-Sternberg, ClustalX
ClustalX Algorithm
• Perform pairwise alignments and calculate
distances for all pairs of sequences
• Construct guide tree (dendrogram) joining the
most similar sequences using Neighbour Joining
• Align sequences, starting at the leaves of the guide
tree. This involves the pair-wise comparisons as
well as comparison of single sequence with a
group of seqs (Profile)
• ClustalX is not optimal
• There are known areas in which ClustalX
performs badly e.g.
– errors introduced early cannot be corrected by
subsequent information
– alignments of sequences of differing lengths
cause strange guide trees and unpredictable
effects
– edges: ClustalX does not penalise gaps at edges
• There are alternatives to ClustalX available
T-Coffee
• JMB 2000
• Also a progressive alignment method
• Designed to solve some of the problems
with clustal (in particular the problem of
clustals inability to correct errors that
appear early in the process of alignment)
• Can consider global and local pair-wise
alignments
Using ClustalX
• Start with sequences in FASTA format (or
an existing alignment in Clustal format
• [Do Alignment] on the alignment menu
ClustalX Parameters
• Scoring Matrix
• Gap opening penalty
• Gap extension penalty
• Protein gap parameters
• Additional algorithm parameters
• Secondary structure penalties
Score Matrices
• Pairwise matrices and multiple alignment
matrix series
• PAM (Dayhoff), BLOSUM (Hennikof),
GONNET (default), user defined
• Transition (A<->G)/Transversion (C<-T)
ratio – low for distantly related sequences
Gap Penalties
• Linear gap penalties – Affine gap penalties
p = (o + l.e)
• Gap opening
• Gap extension
• Protein specific penalties (on by default)
– Increase the probability of gaps associated with certain
residues
– Increase the chances of gaps in loop regions (> 5
hydrophilic residues)
Algorithm parameters
• Slow-accurate pair-wise alignment
• Do alignment from guide tree
• Reset gaps before aligning (iteration)
• Delay Divergent sequences (%)
Additional displays
• Column Scores
• Low quality regions
• Exceptional residues
Multiple Alignment Tips
• Align pairs of sequences using an optimal method
• Progressive alignment programs such as ClustalX
for multiple alignment
• Choose representative sequences to align carefully
• Choose sequences of comparable lengths
• Progressive alignment programs may be combined
• Review alignment by eye and edit
• If you have a choice align amino acid sequences
rather than nucleotides
Alignment of coding regions
• Nucleotide sequences much harder to align
accurately than proteins
• Protein coding sequences can be aligned using the
protein sequences
– e.g. BioEdit: toggle translation to amino acid, call
clustalw to align, edit alignment by hand, toggle back
to nucleotide
• In-frame nucleotide alignments can be used, e.g.
to determine non-synonymous and synonymous
distances separately
Multiple Alignments and Phylogenetic Trees
– You can make a more accurate multiple
sequence alignment if you know the tree
already
– A phylogenetic tree is only as good as the
alignment from which it was produced
– The process of constructing a multiple
alignment (unlike pair-wise) needs to take
account of phylogenetic relationships
Editing a multiple sequence
alignment
• It is NOT fraud to edit a multiple sequence
alignment
• Incorporate additional knowledge if
possible
• Alignment editors help to keep the data
organised and help to prevent unwanted
mistakes
Alignment Editors
• e.g. GDE, Bioedit, Seaview, Jalview etc.
• Some alignment editors have begun to
function as sequence analysis platforms
(e.g. tools on BioEdit, GDE)
• Construct sub-sequences (GDE, Seaview)
• Annotate sequences (Seaview)
Aligning weakly similar
sequences
Sequence contains conserved
regions
• e.g. DIALIGN (Morgenstern, Dress, Werner)
– re-aligns regions between conserved blocks
http://bibiserv.techfak.uni-bielefeld.de/
useful if sequences contains consistent conserved blocks
• Block Maker – searches for conserved words that
may be inconsistent http://blocks.fhcrc.org/
Profile Alignment
Gribskov et al. 1987
• Position specific scores
• Allows addition of extra sequence(s) to an
alignment
• Allows alignment of alignments
• Gaps introduced as whole columns in the separate
alignments
• Optimal alignment in time O(a2l2)
a = alphabet size, l = sequence length
• Information about the degree of conservation of
sequence positions is included
Good reasons to use profile
alignments
– Adding a new sequence to an existing multiple
alignment that you want to keep fixed
(align sequence to profile)
– Searching a database for new members of your protein
family
(pfsearch)
– Searching a database of profiles to find out which one
your sequence belongs to
(pfscan)
– Combining two multiple sequence alignments
(profile to profile)
Profile Alignment Using
ClustalX
• Profile Alignment Mode
• Align sequence to profile
• Align profile 1 to profile 2
• Secondary structure parameters
Profile searching using PSI-
BLAST
• Position Specific Iterative
• Perform search – construct profile –
perform search
• Convergence (hopefully…)
• Increased sensitivity for distantly related
sequences
• Available on-line (NCBI)
Databases of Aligned Sequences
• Hovergen http://pbil.univ-
lyon1.fr/databases/hovergen.html (vertebrate
alignments)
• Pfam http://www.sanger.ac.uk/Software/Pfam/
(protein domain alignments and profile HMMs)
• BLOCKS http://blocks.fhcrc.org/
• Ribosomal Database Project
http://rdp.cme.msu.edu/html/ alignments and trees
derived from rRNA sequences
• Interpro – combines information from other
sources
• Many more…
Probabilistic Models of Sequence
Alignment
• Hidden Markov Models
– sequence of states and associated symbol probabilities
• Produces a probabilistic model of a sequence
alignment
• Align a sequence to a Profile Hidden Markov
Model
– Algorithms exist to find the most efficient pathway
through the model
Markov Chain: A chain of things. The
probability of the next thing depends only
on the current thing
Hidden Markov Model: A sequence of states
which form a Markov Chain. The states are
not observable. The observable characters
have “emission” probabilities which depend
on the current state.
Some more recent developments
• The need to align genomes
– alignment tools required that can align very
large regions of genomes
– poses a computational challenge
– programmes such as dialign can be run in
parallel on multiprocessor machines
Some more recent developments
• MUSCLE
– Faster (uses a k-mer frequency to calculate first pair-
wise alignments)
– Progressive (repeats the MSA using the more accurate
kimura distance between aligned amino acid sequences)
– Has a third optimisation stage that involves making
profile alignments of sub-trees and accepting the new
alignment if it improves the SP score.
• MuSiC - multiple sequence alignment with
constraints
– web server that allows a user to enter a set of
Get documents about "