BioPerl MUMmer by rogerholland

VIEWS: 142 PAGES: 26

									BIOPERL:   MUMMER
                    Jason Switzer
                    Joshua Wu
                    Aimee Seufer
Agenda

 Background
   BioPerl
   Mummer
 Algorithm/Data Structure
   Suffix Trees (implicit vs explicit)
   Examples
   Limitations for BioPerl
 Bio::AlignIO::mummer
What is BioPerl

• A project (developed by volunteer engineers)
  – Goal: collect computational methods routinely
    used in bioinformatics
• Tools for computational molecular biology:
  – bioinformatics toolkit for format conversion,
    report processing, data manipulation, sequence
    analysis, batch processing and more
• Open source
  – http://bioperl.org/
• Collection of modules (1450)
Example
Example
What is MUMmer

 MUMmer: maximum unique [exact] match


 Definition: it is a suffix tree algorithm
  designed to find maximal exact matches of
  some minimum length between two input
  sequences.
  Achievements
• Suffix tree: a very efficient data structure
  – constructed and searched in linear time
  – ideal for large scale pattern matching
  – Memory usage dependent only on reference sequence
• Finds maximal unique solution to dataset


• How efficient? find all 20 base pair maximal exact
  matches between 2 ~5 million base pair bacterial
  genomes in 20 seconds, using 90 MB of RAM, on a
  typical 1.7 GHz Linux desktop computer
         Why Suffix Trees?

 "Suffix trees are widely used in the computer
  field... Recent improvements in the method
  have cut the memory requirement to 17 bytes
  per letter, which brings the method to the
  verge of practicality [for bioinformatics
  applications]" -- Nat Goodman (Genome
  Technology).
              Introduction

 Any string of length m can be degenerated
  into m suffixes, and these suffixes can be
  stored in a suffix tree.

 Setup time O(m) (m is length of string)


 searching time O(n) (n is length of pattern)
 Sample input:            Homo Sapien

 cagctcctgagactgctggcatgaaggggagccgtgccctcct
  gctggtggccctcaccctgttctgcatctgccggatggccacag
  gggaggacaacgatgagtttttcatggacttcctgcaaacacta
  ctggtggggaccccagaggagctctatgaggggaccttgggc
  aagtacaatgtcaacgaagatgccaaggcagcaatgactgaa
  ctcaagtcctgcagagatggcctgcagccaatgcacaaggcgg
  agctggtcaagctgctggtgcaagtgctgggcagtcaggacg
  gtgcctaagtggacctcagacatggctcagccataggacctgcc
  acacaagcagccgtggacacaacgcccactaccacctcccacat
  ggaaatgtatcctcaaaccgtttaatcaataa
Sample result
   Sample input 2: plants

 EARPIVVGPPPPLSGGLPGTENSDQARDGTL
 PYTKDRFYLQPLPPTEAAQRAKVSASEILNVK
 QFIDRKAWPSLQNDLRLRASYLRYDLKTVISA
 KPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPT
 EAEKYYGQTVSNINEVLAKLG
Sample output:
Comparisons: Homo Sapiens
Chicken
    Sample Input: Chicken

 RVKRVWPLVIRTVIAGYNLYRAIKKK
Chicken
               Investigation

 Explicit suffix trees require more space than
  implicit suffix trees in real data.

 Explicit trees should be used for smaller use
  of storage
Limitations on development

 No unified output format for all tools
   Tools available: mummer, repeat-match, exact-
    tandems, gaps, mgaps, nucmer, promer, run-
    mummer1, run-mummer3, show-aligns, show-
    coords, show-snps, show-tiling
 Variable command line options
 Poor documentation
 Not user-friendly
            Difficulties With BioPerl

Extensive Framework
 • everything from IO utilities to BLAST
Aging Codebase
 • lots of copy-and-paste
 • old coding techniques data
Few Developers
 • few core developers maintaining the toolkit
 • few people understand the
Uncommon Perl Practices
 • much derision over practices such as Tie handles and
   AUTOLOAD
           Common BioPerl Objects

Bio::SeqIO
 • reads/writes sequence files (e.g. genbank)
 • fully symmetric converter (between various formats)
 • lots of documentation
Bio::Seq
 • store various sequences (Bio::RichSeq)
 • used primarily with Bio::SeqIO
Bio::AlignIO
 • reads/writes alignment files (fasta)
 • not fully symmetric (between formats)
 • significantly less documentation
Bio::LocatableSeq
 • stores locatable sequence data (within another sequence)
 • used primarily with Bio::AlignIO
Example - Input Data
Example - Code
THANK YOU FOR LISTENING

								
To top