WHO MSAppt - Multiple Sequence Alignment by keara

VIEWS: 10 PAGES: 48

									Multiple Sequence Alignment
               Definition
• Homology: related by descent

• Homologous sequence positions

ATTGCGC      ATTGCGC   ATTGCGC
ATTGCGC                AT-ACGC
   A         ATACGC
      Reasons for aligning sets of
              sequences
•   Organise data to reflect sequence homology
•   Estimate evolutionary distance
•   Infer phylogenetic trees from homologous sites
•   Highlight conserved sites/regions
•   Highlight variable sites/regions
•   Uncover changes in gene structure
•   Look for evidence of selection
•   Summarise information
Alignments help to


              Organise
              Visualise
               Analyze

             Sequence Data
 The process of aligning sequences
is a game involving playing off gaps
and mismatches
    Ways of aligning multiple
           sequences
• By hand
• Automated
• Combination
Definition

 Optimality criteria: some kind rule or
 scoring scheme to help you to decide what
 you consider to be the best alignment
 Pairwise vs Multiple Sequences
• Pairs of sequences typically aligned using
  exhaustive algorithms (dynamic
  programming)
  – complexity of exhaustive methods is O(2n mn)
    n = number of sequences m = sequence length
• Multiple sequence alignment usually
  performed using heuristic methods
     The Correct Alignment

ATTGCGC    ATTGCGC   ATTGCGC
ATTGCGC              AT-ACGC
   A       ATACGC

                      ATTGCGC
                     ATA-CGC
  The Correct Alignment
             Correct        Correct
             according to   according to
             optimality     homology
             criteria
Exhaustive   Always         Not always
methods

Heuristic    Not always     Not always
methods
• Sequence alignment is easy with
  sufficiently closely related sequences
• Below a certain level of identity sequence
  alignment may become meaningless
  – twilight zone for aa sequences ~ 30%
• In the twilight zone it is good to make use
  of additional information if possible (e.g.
  structure)
            Consensus Sequences
• Simplest Form:
  A single sequence which represents the most
  common amino acid/base in that position

Y   D   D    G    A     V     -     E     A     L
Y   D   G    G    -     -     -     E     A     L
F   E   G    G    I     L     V     E     A     L
F   D   -    G    I     L     V     Q     A     V
Y   E   G    G    A     V     V     Q     A     L
Y   D   G    G    A/I   V/L   V     E     A     L
   Multiple Alignment Formats
e.g. Clustal, Phylip, MSF, MEGA etc. etc.
                 Clustal Format
CLUSTAL X (1.81) multiple sequence alignment



CAS1_BOVIN     MKLLILTCLVAVALARPKHPIKHQGLPQ--------EVLNEN-
CAS1_SHEEP     MKLLILTCLVAVALARPKHPIKHQGLSP--------EVLNEN-
CAS1_PIG       MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSRE--------
CAS1_HUMAN     MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE--------
CAS1_RABBIT    MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERK
CAS1_MOUSE     MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ------QQHSSSE
CAS1_RAT       MKLLILTCLVAAALALPRAHRRNAVSSQTQ-------------
               *:***: **.*.*:* :    .     :
             Phylip Format (Interleaved)
 7    100
SOMA_BOVIN   MMAAGPRTSL   LLAFALLCLP   WTQVVGAFPA   MSLSGLFANA   VLRAQHLHQL
SOMA_SHEEP   MMAAGPRTSL   LLAFTLLCLP   WTQVVGAFPA   MSLSGLFANA   VLRAQHLHQL
SOMA_RAT_P   -MAADSQTPW   LLTFSLLCLL   WPQEAGAFPA   MPLSSLFANA   VLRAQHLHQL
SOMA_MOUSE   -MATDSRTSW   LLTVSLLCLL   WPQEASAFPA   MPLSSLFSNA   VLRAQHLHQL
SOMA_RABIT   -MAAGSWTAG   LLAFALLCLP   WPQEASAFPA   MPLSSLFANA   VLRAQHLHQL
SOMA_PIG_P   -MAAGPRTSA   LLAFALLCLP   WTREVGAFPA   MPLSSLFANA   VLRAQHLHQL
SOMA_HUMAN   -MATGSRTSL   LLAFGLLCLP   WLQEGSAFPT   IPLSRLFDNA   MLRAHRLHQL

             AADTFKEFER   TYIPEGQRYS   -IQNTQVAFC   FSETIPAPTG   KNEAQQKSDL
             AADTFKEFER   TYIPEGQRYS   -IQNTQVAFC   FSETIPAPTG   KNEAQQKSDL
             AADTYKEFER   AYIPEGQRYS   -IQNAQAAFC   FSETIPAPTG   KEEAQQRTDM
             AADTYKEFER   AYIPEGQRYS   -IQNAQAAFC   FSETIPAPTG   KEEAQQRTDM
             AADTYKEFER   AYIPEGQRYS   -IQNAQAAFC   FSETIPAPTG   KDEAQQRSDM
             AADTYKEFER   AYIPEGQRYS   -IQNAQAAFC   FSETIPAPTG   KDEAQQRSDV
             AFDTYQEFEE   AYIPKEQKYS   FLQNPQTSLC   FSESIPTPSN   REETQQKSNL
    Phylip Format (Sequential)
3 100
Rat
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCG
TGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAA
Mouse
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCT
TGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGG
Rabbit
ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGC
TGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG
               Mega Format
#mega
TITLE: No title

#Rat        ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
#Mouse      ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
#Rabbit     ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
#Human      ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
#Oppossum   ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
#Chicken    ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
#Frog       ---ATGGGTTTGACAGCACATGATCGT---CAGCT
Progressive Multiple Alignment
• Heuristic
• Perform pairwise alignments
• Align sequences to alignments or
  alignments to existing alignments (profile
  alignments
• Do the alignments in some sensible order
Progressive versus Simultaneous
• speed versus accuracy
• simultaneous methods are capable of
  working out an „exact‟ solution to the
  problem of multiple sequence alignment
  (e.g. NCBI‟s MSA – user interface QAlign)
           Iterative methods
• Several progressive alignment methods can
  be iterated
  – e.g. Barton-Sternberg, ClustalX
          ClustalX Algorithm
• Perform pairwise alignments and calculate
  distances for all pairs of sequences
• Construct guide tree (dendrogram) joining the
  most similar sequences using Neighbour Joining
• Align sequences, starting at the leaves of the guide
  tree. This involves the pair-wise comparisons as
  well as comparison of single sequence with a
  group of seqs (Profile)
• ClustalX is not optimal
• There are known areas in which ClustalX
  performs badly e.g.
  – errors introduced early cannot be corrected by
    subsequent information
  – alignments of sequences of differing lengths
    cause strange guide trees and unpredictable
    effects
  – edges: ClustalX does not penalise gaps at edges
• There are alternatives to ClustalX available
                 T-Coffee
• JMB 2000
• Also a progressive alignment method
• Designed to solve some of the problems
  with clustal (in particular the problem of
  clustals inability to correct errors that
  appear early in the process of alignment)
• Can consider global and local pair-wise
  alignments
           Using ClustalX
• Start with sequences in FASTA format (or
  an existing alignment in Clustal format
• [Do Alignment] on the alignment menu
          ClustalX Parameters
•   Scoring Matrix
•   Gap opening penalty
•   Gap extension penalty
•   Protein gap parameters
•   Additional algorithm parameters
•   Secondary structure penalties
             Score Matrices
• Pairwise matrices and multiple alignment
  matrix series
• PAM (Dayhoff), BLOSUM (Hennikof),
  GONNET (default), user defined
• Transition (A<->G)/Transversion (C<-T)
  ratio – low for distantly related sequences
                   Gap Penalties
• Linear gap penalties – Affine gap penalties
   p = (o + l.e)
• Gap opening
• Gap extension
• Protein specific penalties (on by default)
   – Increase the probability of gaps associated with certain
     residues
   – Increase the chances of gaps in loop regions (> 5
     hydrophilic residues)
          Algorithm parameters
•   Slow-accurate pair-wise alignment
•   Do alignment from guide tree
•   Reset gaps before aligning (iteration)
•   Delay Divergent sequences (%)
         Additional displays
• Column Scores
• Low quality regions
• Exceptional residues
      Multiple Alignment Tips
• Align pairs of sequences using an optimal method
• Progressive alignment programs such as ClustalX
  for multiple alignment
• Choose representative sequences to align carefully
• Choose sequences of comparable lengths
• Progressive alignment programs may be combined
• Review alignment by eye and edit
• If you have a choice align amino acid sequences
  rather than nucleotides
   Alignment of coding regions
• Nucleotide sequences much harder to align
  accurately than proteins
• Protein coding sequences can be aligned using the
  protein sequences
   – e.g. BioEdit: toggle translation to amino acid, call
     clustalw to align, edit alignment by hand, toggle back
     to nucleotide
• In-frame nucleotide alignments can be used, e.g.
  to determine non-synonymous and synonymous
  distances separately
Multiple Alignments and Phylogenetic Trees
 – You can make a more accurate multiple
   sequence alignment if you know the tree
   already
 – A phylogenetic tree is only as good as the
   alignment from which it was produced
 – The process of constructing a multiple
   alignment (unlike pair-wise) needs to take
   account of phylogenetic relationships
    Editing a multiple sequence
             alignment
• It is NOT fraud to edit a multiple sequence
  alignment
• Incorporate additional knowledge if
  possible
• Alignment editors help to keep the data
  organised and help to prevent unwanted
  mistakes
          Alignment Editors
• e.g. GDE, Bioedit, Seaview, Jalview etc.
• Some alignment editors have begun to
  function as sequence analysis platforms
  (e.g. tools on BioEdit, GDE)
• Construct sub-sequences (GDE, Seaview)
• Annotate sequences (Seaview)
Aligning weakly similar
       sequences
   Sequence contains conserved
             regions
• e.g. DIALIGN (Morgenstern, Dress, Werner)
   – re-aligns regions between conserved blocks
   http://bibiserv.techfak.uni-bielefeld.de/
   useful if sequences contains consistent conserved blocks
• Block Maker – searches for conserved words that
  may be inconsistent http://blocks.fhcrc.org/
           Profile Alignment
                               Gribskov et al. 1987
• Position specific scores
• Allows addition of extra sequence(s) to an
  alignment
• Allows alignment of alignments
• Gaps introduced as whole columns in the separate
  alignments
• Optimal alignment in time O(a2l2)
a = alphabet size, l = sequence length
• Information about the degree of conservation of
  sequence positions is included
  Good reasons to use profile
         alignments
– Adding a new sequence to an existing multiple
  alignment that you want to keep fixed
  (align sequence to profile)
– Searching a database for new members of your protein
  family
  (pfsearch)
– Searching a database of profiles to find out which one
  your sequence belongs to
  (pfscan)
– Combining two multiple sequence alignments
  (profile to profile)
        Profile Alignment Using
                ClustalX
•   Profile Alignment Mode
•   Align sequence to profile
•   Align profile 1 to profile 2
•   Secondary structure parameters
    Profile searching using PSI-
               BLAST
• Position Specific Iterative
• Perform search – construct profile –
  perform search
• Convergence (hopefully…)
• Increased sensitivity for distantly related
  sequences
• Available on-line (NCBI)
 Databases of Aligned Sequences
• Hovergen http://pbil.univ-
  lyon1.fr/databases/hovergen.html (vertebrate
  alignments)
• Pfam http://www.sanger.ac.uk/Software/Pfam/
  (protein domain alignments and profile HMMs)
• BLOCKS http://blocks.fhcrc.org/
• Ribosomal Database Project
  http://rdp.cme.msu.edu/html/ alignments and trees
  derived from rRNA sequences
• Interpro – combines information from other
  sources
• Many more…
Probabilistic Models of Sequence
            Alignment
• Hidden Markov Models
   – sequence of states and associated symbol probabilities
• Produces a probabilistic model of a sequence
  alignment
• Align a sequence to a Profile Hidden Markov
  Model
   – Algorithms exist to find the most efficient pathway
     through the model
Markov Chain: A chain of things. The
 probability of the next thing depends only
 on the current thing

Hidden Markov Model: A sequence of states
 which form a Markov Chain. The states are
 not observable. The observable characters
 have “emission” probabilities which depend
 on the current state.
Some more recent developments
• The need to align genomes
  – alignment tools required that can align very
    large regions of genomes
  – poses a computational challenge
  – programmes such as dialign can be run in
    parallel on multiprocessor machines
Some more recent developments
• MUSCLE
  – Faster (uses a k-mer frequency to calculate first pair-
    wise alignments)
  – Progressive (repeats the MSA using the more accurate
    kimura distance between aligned amino acid sequences)
  – Has a third optimisation stage that involves making
    profile alignments of sub-trees and accepting the new
    alignment if it improves the SP score.
• MuSiC - multiple sequence alignment with
  constraints
  – web server that allows a user to enter a set of

								
To top