T-Coffee A Novel Method for Fast and Accurate Multip

Document Sample
T-Coffee A Novel Method for Fast and Accurate Multip Powered By Docstoc
					T-Coffee: A Novel Method
 for Accurate Multiple
      Fast and

  Sequence Alignment
 Developed by Cédric Notredame et al, CNRS
    Information Génomique et Structurale
       What is T-Coffee ???
• Tree-based Consistency Objective
  Function for alignment Evaluation.
• A multiple sequence alignment software
  using a progressive approach.
• It was developed as an attempt to improve
  upon ClustalW
       How does it work ???
• Generates a library.
• Combines multiple/pair-wise, global/local
  alignments into a single one.
• Different sources
• Estimates the level of consistency
• Uses optimization method to find the MSA
  best fitting pair-wise alignments in library
• Indicator of alignment accuracy.
• Global Alignments try to align full length
• Local Alignments Algorithms provide sets of
  non-overlapping alignments from comparison
  – These perform well when there are clear blocks of
    common ungapped alignment
• T- Coffee combines the best properties of each
• A simple, flexible and accurate solution using
  combination of all the information
           Primary Library

• Primary library  pair-wise alignment b/w
  all sequences
• Two sources  ClustalW, Lalign
• Weight Assignment to pairs
• Pooling of libraries
          Extended Library

• For higher level of consistency
• Pair-wise alignments are combined
  through intermediate sequence
• Removal of mismatches
        Progressive Alignment
•   Distance Matrix
•   Guide Tree using neighbor-joining
•   Closest two sequences aligned
•   Dynamic Programming
•   Fixed
•   Next Closest pair or sequence added
•   All sequences aligned
                   Related Work
• Related Topics
  – COFFEE: A New Objective Function For Multiple Sequence Alignmnent.
  – T-Coffee: A novel method for multiple sequence alignments.
  – Using Multiple Alignment Methods to Assess the Quality of Genomic
    Data Analysis, in Bioinformatics and Genomes
  – 3DCoffee: Combining Protein Sequences and Structures within Multiple
    Sequence Alignments.

• Related Tools
  –   M-Coffee
  –   Expresso
  –   Mocca
  –   3D-Coffee
    Download, Installation, Running…
•   Latest Version 5.03
•   Download from www.tcoffee.org
•   Or use Online Server at www.tcoffee.org
•   Runs on Unix / Linux / MAC osX
•   Runs on Windows through Cygwin
     – Cygwin is a Linux-like environment for Windows.
     – A DLL (cygwin1.dll) which acts as a Linux API
       emulation layer providing substantial Linux API
     – A collection of tools which provide Linux look and feel.
 Download, Installation and Run…
• Unix, MAC osX, Linux
  –   gunzip t_coffee.tar.gz
  –   tar -xvf t_coffee.tar
  –   cd t_coffee
  –   ./install
  –   go into <distrib> folder in which you have input files
  –   ./t_coffee xxx.yyy
• Windows
  – Install Cygwin
  – Follow Linux procedure
         What Can it do ???
• Align nucleic and protein sequences.
• Use structural information for protein
  sequences with a known structure.
• Compare alignments
• Reformat files
• Evaluate alignment using structural
• Simplest Usage
  – t_coffee xxx.yyy
• Combining Alignments
  – t_coffee –aln=a_cw.msf, a_mus.msf, a_tc.msf –
• Evaluating Alignments
  – t_coffee –infile= xxx.yyy –special_mode=evaluate
  – Creates xxx.score_ascii and xxx.html.
                Color Code
• The color scheme of T-Coffee is an
  indicator of the reliability of the alignment.
• Red bits are the more consistent and
  therefore the more likely to be correctly
• Blue bits are the less trustable.
• Combining Sequences and Structures
  – t_coffee 3d.fasta –special_mode=3dcoffee
  – T-Coffee to automatically identify the target corresponding to
    your sequence as indicated by an NCBI BLAST.
  – PDB sequences from RCSB (Research Collaboratory for
    Structural Bioinformatics ).

• Identifying occurunces of a motif
  – Uses special mode Mocca
  – t_coffee –other_pg mocca sample_seq1.fasta
                 Reformatting Utility
• t_coffee -other_pg seq_reformat
• Removing the gaps from an alignment
   – t_coffee -other_pg seq_reformat -in abc.aln -output fasta_seq >
• Changing file formats
   – t_coffee -other_pg seq_reformat -in abc.aln -output msf > abc.msf
• Colouring residues in an Alignment
   – seq_reformat -in=abc.aln -struc_in=cache.seq -struc_in_f number_fasta
     -output=color_html -out=x.html
• Selectively modifying residues
   – seq_reformat -in sample_aln7.aln -struc_in sample_aln7.cache_aln -
     struc_in_f number_aln -action +lower '[1-2]'
   – List of actions  upper, lower, keep, switchcase, remove, convert
            Reformatting Utility
• Extracting Sequences
   – t_coffee -other_pg seq_reformat -in sproteases_small.aln -action +grep
     NAME REMOVE HUMAN -output clustalw
   – t_coffee -other_pg seq_reformat -in sproteases_small.aln -action
     +extract_block cons 100 120 > block1.aln
   – Extracting most informative sequences
   – Identifying and removing outliers.
• Concatenating Alignments
   – t_coffee -other_pg seq_reformat -in block1.aln -in2 block2.aln -action
• Manipulating DNA sequences
   – t_coffee -other_pg seq_reformat -in sproteases_small_dna.fasta -action
     +translate -output fasta_seq
   – T-Coffee works better with proteins
                More Features
• Fetching Structures
  – t_coffee -other_pg extract_from_pdb -infile 1PPGE
• Dealing with Phylogentic Trees
  – Comparing two phylogenetic trees
  – seq_reformat -in sample_tree2.dnd -in2 sample_tree3.dnd -
    action +tree_cmp -output newick
  – Prunning Phylogenetic Trees
  – seq_reformat -in sample_tree2.dnd -in2 sample_seq8.seq -
    action +tree_prune -output newick
• Aligning Large datasets
  – t_coffee sproteases_large.fasta -special_mode quickaln
  – Faster Aligning mode with reduced accuracy
           More Features
• Changing the Substitution Matrix
• Comparing Two Alternative Alignments
• Changing gap Penalty
       Associated Packages
• M-Coffee
  – Meta Coffee computes multiple sequence
    alignment using various specified and
    installed MSA Packages.
• Expresso
  – Latest mode
  – Finds similar structure to use as template
        Evaluating using iRMSD
• intra-catener Root Mean Square Deviation
   1- Make sure you include two structures whose sequences are so distantly
      related that most of the other sequences are intermediates.
   2- Align your sequences without using the structural information (i.e.
      t_coffee, muscle...)
   3- Evaluate your alignment with irmsd (see later in this section). The score
      will be S1
   4- Realign your sequences, but this time using structural information
   5- Measure the score of that alignment (Score=S2)
• S1 and S2 are almost similar, it means your distantly
  related structures were well aligned
• Expresso claims to have the best results in this test
    Aligning Reference Sequneces

•   Download Balibase Reference Sequence
•   Unzip Files
•   Run t_coffee on the fasta files
•   C program for aligning files
•   System Calls to t_coffee
              Scoring Alignment
• Use the bali_score C program that comes with the
  Balibase reference package
• Install and set the path for the expath XML parser
• Make file
• Compare
   – bali_score ref_aln test_aln
• Use C program with system calls to compare
   – t_coffee aligned files
   – balibase reference aligned files
   – muscle aligned files (for comparison with t_coffee)
             Scoring Alignment
• Aligned BB50001.tfa  fasta input file
  – Using T-Coffee  tcBB50001.msf (time taken very large in minutes)
  – Using Muscle  musBB50001.msf (time taken very small in seconds)
• Compared the alignment with the Balibase
  reference sequence
  For T-Coffee
  SP  Sum of Pairs  0.736
  TC  Total Column  0.240

  For Muscle
  SP  Sum of Pairs  0.757
  TC  Total Column  0.400
       Conclusions (so far)
• T- Coffee is slow
• No considerable increase in accuracy
• Newer modes of T-Coffee however seem
  more promising
• Faster MSA packages like Muscle should
  be considered.
          Web Resources

• Website http://www.tcoffee.org/
• Link for the Journal and source code
• General References www.wikipedia.org
Questions ???