Lecture_2_-_More_What__Bio_Perl_can_do.ppt by ewghwehws

VIEWS: 6 PAGES: 59

									   More “What Perl can do”

With an introduction to BioPerl

            Ian Donaldson
     Biotechnology Centre of Oslo
              MBV 3070
Much of the material in this lecture is from the
“Perl” lecture and lab developed for
the Canadian Bioinformatics Workshops by

Will Hsiao
Sohrab Shah
Sanja Rogic

And released under the Creative Commons license
http://creativecommons.org/licenses/by-sa/2.5/
     More “What can Perl do”
• So far, we’ve had a very brief introduction
  to Perl
• Next, we want to go a little deeper into

•   Use of “strict”
•   Perl regular expressions
•   Modules
•   An introduction to object-oriented Perl and
•   BioPerl
strict
       Effects of “use strict”
• Requires you to declare variables
     Correct             Incorrect
     my $DNA;            $DNA = “ATCG”;
     $DNA = “ATCG”;
     or
     my $DNA = “ATCG”;

• Warns you about possible typos in
  variables
     No warning          Warning
     my $DNA = “ATCG”;   my $DNA = “ATCG”;
     $DNA                $DAN
     =~tr/ATCG/TAGC/     =~tr/ATCG/TAGC
       Why bother “use strict”

• Enforces some good programming rules
• Helps to prevent silly errors
• Makes trouble shooting your program
  easier
• Becomes essential as your code becomes
  longer
• We will use strict in all the code you see
  today and in your assignment
• Bottom line: ALWAYS use strict
                                     Exercise
                                           12
Write a program that has one function.

Use a variable named “$some_variable” in this
function and in the main body of the program.

Prove that you can alter the value of
$some_variable in the function without
changing the value of $some_variable in the
the main body of the program.

Try it yourself (15 minutes) then check the
answer at the end of this lecture.
regular expressions
          What is a Regular Expression?
    • REGEX provides pattern matching ability
    • Tells you whether a string contains a pattern or
      not (Note: it’s a yes or no question!)‫‏‬

                                                                     Dog!‫‏‬Human’s‫‏‬best‫‏‬friend
                  “My‫‏‬dog‫‏‬ate‫‏‬my‫‏‬homework”

“Yesterday‫‏‬I‫‏‬saw‫‏‬a‫‏‬big‫‏‬                     “I‫‏‬have‫‏‬a‫‏‬golden‫‏‬retriever”
black‫‏‬dog”


             Regular Expression looking for “dog”



     “Yes”‫‏‬or‫“‏‬True”      “Yes”‫‏‬or‫“‏‬True”        “No”‫‏‬or‫“‏‬False”          “No”‫‏‬since‫‏‬
                                                                          REGEX is case
                                                                          sensitive
Regular expressions are “regular”
 Look at these names for yeast open reading frame names.

 YDR0001W
 YDR4567C
 YAL0045W
 YBL0008C

 While they are all different, they all follow a pattern
 (or regular expression).
 1. Y means yeast
 2. some letter between A and L represent a chromosome
 3. an ‘R’ or ‘L’ refers to an arm of the chromosome
 4. a four digit number refers to an open reading frame
 5. A ‘W’ or a ‘C’ refers to either the Watson or Crick strand

 You can write a regular expression to recognize ALL yeast
   open reading frame names.
             Perl REGEX example
my $text = “The dog ate my homework”;
if ($text =~ m/dog/){
  print “The text contains a dog\n”;
}
• =~ m is the binding operator. It says: “does the
  string on the left contain the pattern on the right?”
• /dog/ is my pattern or regular expression
• The matching operation results in a true or false
  answer
       Regular Expressions in Perl
• A pattern that match only one string is not very
  useful!
• We need symbols to represent classes of characters
• For example, say you wanted to recognize ‘Dog’ or
  ‘dog’ as being instances of the same thing

• REGEX is its own little language inside Perl
   – Has different syntax and symbols!
   – Symbols which you have used in perl such as $ . { }
     [ ] have totally different meanings in REGEX
         REGEX Metacharacters


• Metacharacters allow a pattern to match
  different strings
   – Wildcards are examples of metacharacters
   – /.og/ will match “dog”, “log”, “tog”, “ og”, etc.

   – So . Means “any character”
   – Perl REGEX has much more powerful
     metacharacters used to represent classes of
     characters
             Types of Metacharacters
.             matches any one character or space except
                   “\n”

[]            denotes a selection of characters and
                           matches ONE of the characters
     in the                selection. What does [ATCG]
     match?

\t, \s, \n    match a tab, a space and a newline
                           respectively

\w            matches any characters in [a-zA-Z0-9]

\d            matches [0-9]

\D            matches anything except [0-9]
     Using metacharacters to build a
           regular expression
               YBL3456W

  /Y[A-L][RL]\d\d\d\d[WC]/

Is this a good pattern for a yeast ORF name?
What else does it match?
What if the name only has 3 digits?
        REGEX Quantifiers
• What if you want to match a character
  more than once?

• What if you want to match an mRNA with
  a polyA tail that is at least 5 – 12 A’s?


“ATG……AAAAAAAAAAA”
           REGEX Quantifiers
            “ATG……AAAAAAAAAAA”

      /ATG[ATCG]+A{5,12}/

• + matches one or more copies of the previous
  character
• * matches zero or more copies of the previous
  character
• ? matches zero or one copy of the previous
  character
• {min,max} matches a number of copies within the
  specified range
              REGEX Anchors

• The previous pattern is not strictly correct
  because:
   – It’ll match a string that doesn’t start with ATG
   – It’ll match a string that doesn’t end with poly
     A’s

• Anchors tell REGEX that a pattern must occur at
  the beginning or at the end of a string
         REGEX Anchors
• ^ anchors the pattern to the
  start of a string
• $ anchors the pattern to the end
  of a string


 /^ATG[ATCG]+A{5,12}$/
                  REGEX is greedy!


• The revised pattern is still incorrect because
   – It’ll match a string that has more than 12 A’s at the end
• quantifiers will try to match as many copies of a sub-pattern
  as possible!
   /^ATG[ATCG]+A{5,12}$/

   “ATGGCCCGGCCTTTCCCAAAAAAAAAAAA”
   “ATGGCCCGGCCTTTCCCAAAAAAAAAAAA”
                Curb that Greed!
• ? after a quantifier prevents REGEX from being
  greedy


         /^ATG[ATCG]+?A{5,12}$/

   “ATGGCCCGGCCTTTCCGAAAAAAAAAAAA”
   “ATGGCCCGGCCTTTCCGAAAAAAAAAAAA”

• note this is the second use of the question mark -
  what is the other use of ? in REGEX?
            REGEX Capture


• What if you want to keep the part of a
  string that matches to your pattern?
• Use ( ) “memory parentheses”

  “ATGGCCCGGCCTTTCCGAAAAAAAAAAAA”


  /^ATG([ATCG]+?)A{5,12}$/
            REGEX Capture

 /^ATG([ATCG]+?)(A{5,12})$/
              $1            $2

• What’s inside the first ( ) is assigned to $1
• What’s inside the Second ( ) is $2 and so
  on
• So $2 eq “AAAAAAAAAAAA”
               REGEX Modifiers
• Modifiers come after a pattern and affect
  the entire pattern
• You have seen //g already which does global
  matching (/T/g) and global
  replacement(s/T/U/g)‫‏‬
• Other useful modifiers:

//i    make pattern case insensitive
//s    let . match newline
//m    let ^ and $ (anchors) match next to embedded
       newline
///e   allow the replacement string to be a perl
       statement
             REGEX Summary
• REGEX is its own little language!!!
• REGEX is one of the main strengths
  of Perl


•   To learn more:
•   Learning Perl (3rd ed.) Chapters 7, 8, 9
•   Programming Perl (3rd ed.) Chapter 5
•   Mastering Regular Expression (2nd ed.)
•   http://www.perl.com/doc/manual/html/pod/perlre.html
•   A good cheat sheet is:
    http://www.biotek.uio.no/EMBNET/guides/guideRegExp.pdf
                                            Exercise
                                                  13
In a text file, write out three strings that match
the following regular expression

/^ATG?C*[ATCG]+?A{3,10}$/

Write a program that reads each string from the text
file and checks your answers.


Try it yourself (30 min) then look at the answer at
the end of this lecture.
modules
             What are Modules


• a “logical” collection of functions
• Using modules has the same advantage as using
  functions; i.e., it simplifies code (makes it modular)
  and facilitates code reuse
• Each collection (or module) has its own “name
  space”

  Name space:
  a table containing the names of
  variables and functions used in your
  code
          Why Use Modules?

• Modules allow you to use others’ code to
  extend the functionality of your program.

• There are a lot of Perl modules.
 Finding out what modules you
         already have

In Perl, each module is a file stored in some
directory in your system.

The system that this class is using, stores Perl
modules (like cgi.pm) in one of two directories

C:\bin\Perl\lib
C:\bin\Perl\site\lib
    Finding out what modules you
            already have
• To find out where modules are installed, type

      perl –V

  at the command prompt


• To find out what is installed, type

      perldoc perllocal

  at the command prompt.
             Using Modules

• To use a module, you need to include the
  line:
           use modulename;

  at the beginning of your program.

• But you already knew that…
     use strict;
     use warnings;
        Where to find modules

• You can search for modules (and
  documentation) that may be useful to your
  particular problem using
  http://search.cpan.org/

• CPAN: Comprehensive Perl Archive
  Network
• Central repository for Perl modules and
  more
• “If it’s written in Perl, and it’s helpful and
  free, it’s probably on CPAN”
• http://www.perl.com/CPAN/
                                                       Exercise
Open a web browser
Go to http://search.cpan.org/
                                                             14
Type in “bioperl Tools BLAST”
Follow the link to Bio::Tools::Blast
Browse through this page and the example code

Make a .plx file like this:

#bioperl example code
use strict;
use warnings;

#make the bioperl module (class) accessible to your program
use Bio::Seq;

print"ok - ready to use Bio::Seq";


Does this programme run or return an error?
Bioperl Overview
• The Bioperl project – www.bioperl.org
• Comprehensive, well documented set of
  Perl modules
• A bioinformatics toolkit for:
     •   Format conversion
     •   Report processing
     •   Data manipulation
     •   Sequence analyses
     •   and more!
• Written in object-oriented Perl
Bioperl Overview
• The last exercise most likely did not work
  (unless you have BioPerl installed)‫‏‬
• So let’s install it…
          How to install modules

• This class is using the active state version of Perl
  that comes with a program called ppm (Perl
  Package Manager)‫‏‬

• At the command prompt type

      >ppm

And follow the instructions in the exercise 15
 How to install modules (without ppm)‫‏‬

• If you are not using active state Perl, you
  you can also install modules from CPAN
  using:

  >perl –MCPAN –e “install ‘Some::Module’”

• Module dependency is taken care of
  automatically
• You’ll (usually) need to be root to install a
  module successfully
                                 Exercise
                                       15
Install bioperl
1. At the command line prompt type

  >ppm
2. Then at the ppm prompt type
  ppm> search bioperl

3. Then type
  ppm> install bioperl

Try this exercise at home. Installing
   libraries is not possible at UiO
   computers.
                What are objects?
  • Examples of objects in real life:
     – My car, my dog, my dishwasher…
  • Objects have ATTRIBUTES and METHODS
Some attributes of a my dog Fido:
•Color of fur = brown
•Height = 20 cm
•Owner’s Name = Ian
•Weight = 2 Kg
•Tail position = up

Some methods of my dog Fido:        Fido
•Bark
•Walk
•Run
•Eat
•Wag tail
                  What is a class?
  • A class is a type of object in the real world:
     – Cars, dogs, dishwashers…
  • Classes have ATTRIBUTES and METHODS
Some attributes of a dog:
•Color of fur
•Height
•Owner’s Name
•Weight
•Tail position                                The
                                            concept
Some methods of a dog:                        of a
•Bark                                        “dog”
•Walk
•Run
•Eat
•Wag tail
So an object is an instance of a class

                   class

   The
 concept
 of “dog”                                object




                           Fido
    Objects have unique names called
“references” and classes have names too.
                           class

     Dog
                                           object



           Class name
                                    Fido

                        reference
All classes have a method called new that
         is used to create objects.
                        class

     Dog
                                         object


              Fido = new Dog();

  reference                       Fido
A reference to an object can be used to
   access its properties or methods.
                     class

    Dog
                                        object




             print Fido->bark(); Fido

     woof
A reference to an object can be used to
   access its properties or methods.
                                          class

   Bio::DB:
   :RefSeq
                                                                object
                $refseq = new Bio::DB::RefSeq;




   $molecule = $refseq->get_seq_by_acc(“NP_01014”);
                                                      $refseq



   $molecule = Some sequence record
       Putting it all together
So now that you understand (sort of)‫‏‬
Classes
Objects
Attributes and
Methods

What remains is learning what the different classes ar
that are available in BioPerl and what you can do with t

For the next exercise, use the documentation at biope
to figure out what the following code does…

*see www.bioperl.org/wiki/HOWTOs and
doc.bioperl.org (then click on bioperl-live)‫‏‬
#! /usr/local/bin/perl

                                                                                                  Exercise
# Create and run a program which creates a Seq object and manipulates it:
                                         Make the Bio::Seq class
                                                                                                        16
use Bio::Seq;
                                         available to my program
# initiation of Seq object
$seq = Bio::Seq->new('-seq' =>'CGGCGTCTGGAACTCTATTTTAAGAACCTCTCAAAACGAAACAAGC',
              '-desc' => 'An example',
              '-display_id' => 'NM_005476',            Create a new Bio::Seq
              '-accession_number' => '6382074',         object and initialize
              '-moltype' => 'dna');                       some attributes

# sequence manipulations
$aa = $seq -> moltype();           # one of 'dna','rna','protein'
$ab = $seq -> subseq(5,10);         # part of the sequence as string

$ac = $seq -> revcom;             # returns an object of the reverse complemented sequence
$ac1 = $ac -> seq();

$ad = $seq -> translate;          # returns an object of the sequence translation
$ad1 = $ad -> seq();

$ae = $seq -> translate(undef,undef,1); # returns an object of the sequence translation (using frame 1) (0,1,2 can be used)‫‏‬
$ae1 = $ae -> seq();

print "Molecule Type: $aa\n";
print "Sequence from 5 to 10: $ab\n";
print "Reverse complemented sequence: $ac1\n";
print "Translated sequence: $ad1\n";
print "Translated sequence (using frame 1): $ae1\n";
                                     Exercise
                                           17

Check out the code of several examples
using BioPerl at:

http://bip.weizmann.ac.il/course/prog2/perlBioin
  fo/
           More Bioperl modules
• Bio::SeqIO: Sequence Input/Output
  – Retrieve sequence records and write to files
  – Converting sequence records from one format
    to another
• Bio::Seq: Manipulating sequences
  –   Get subsequences ($seq->subseq($start, $end))‫‏‬
  –   Find the length of the object ($seq->length)‫‏‬
  –   Reverse complement a DNA sequence
  –   Translate a DNA sequence          ….etc.
• Bio::Annotation: Annotate a sequence
  – Assign journal references to a sequence, etc.
  – Bio::Annotation is associated with an entire
    sequence record and not just part of a
    sequence (see also Bio::SeqFeature)‫‏‬
     Some more Bioperl modules
• Bio::SeqFeature: Associate feature annotation to
  a sequence
   – “features” describe specific locations in the
     sequence
   – E.g. 5’ UTR, 3’ UTR, CDS, SNP, etc
   – Using this object, you can add feature
     annotations to your sequences
   – When you parse a genbank file using Bioperl,
     the “features” of a record are stored as
     SeqFeature objects
• Bio::DB::GenBank, GenPept, EMBL and
  Swissprot: Remote Database Access
   – You can retrieve a sequence from remote
     databases (through the Internet) using these
     objects
      Even more Bioperl modules
• Bio::SearchIO: Parse sequence database search
  reports
   – Parse BLAST reports (make custom report)‫‏‬
   – Parse HMMer, FASTA, SIM4, WABA, etc.
   – Custom reports can be output to various
     formats (HTML, Table, etc)‫‏‬
• Bio::Tools::Run::StandAloneBLAST: Run
  Standalone BLAST through perl
   – By combining this and SearchIO, you can
     automate and customize BLAST search
• Bio::Graphics: Draw biological entities (e.g. a gene,
  an exon, BLAST alignments, etc)‫‏‬
                Bioperl Summary
• For Online documentation:
   – For this workshop:
     http://doc.bioperl.org/releases/bioperl-1.4/
   – Tutorial: http://www.bioperl.org/wiki/HOWTO:Beginners
   – HOWTOs: http://www.bioperl.org/wiki/HOWTOs
   – Modules:
     http://www.bioperl.org/wiki/Category:Core_Modules
• Literature:
   – Stajich et al., The Bioperl toolkit: Perl modules for the
     life sciences. Genome Res. 2002 Oct;12(10):1611-8.
     PMID: 12368254
• Bioperl mailing list: bioperl-l@bioperl.org
   – Best way to get help using Bioperl
   – Very active list (upwards of 10 messages a day)‫‏‬
• Use with caution: things change fast and without
  warning (unless you are on the mailing list…)‫‏‬
                 Perl Documents
• In-line documentation
   – POD = plain old documents
   – Read POD by typing perldoc <module name>
   – E.g. perldoc perl, perldoc Bio::SeqIO
• On-line documentation
   – http://www.cpan.org
   – http://www.perl.com
   – http:/www.bioperl.org
• Books
   – Learning Perl (the best way to learn Perl if you know a bit
     about programming already)‫‏‬
   – Beginning Perl for Bioinformatics (example based way to
     learn Perl for Bioinformatics)‫‏‬
   – Programming Perl (THE Perl reference book – not for the
     faint of heart)‫‏‬
            Additional Book References


• Perl Cookbook 2nd edition (quick solutions to 80%
  of what you want to do)‫‏‬
• Learning Perl Objects, References & Modules (for
  people who want to learn objects, references and
  modules in Perl)‫‏‬
• Perl in a Nutshell (an okay quick reference)‫‏‬
• Perl CD Bookshelf, Version 4.0 (electronic version
  of the above books – best value, searchable, and
  kill fewer trees)‫‏‬
• Mastering Perl for Bioinformatics (more example
  based learning)‫‏‬
• CGI Programming with Perl (rather outdated
  treatment on the subject... Not really
  recommended)‫‏‬
• Perl Graphics Programming (if you want to
  generate graphics using Perl; side note – Perl is
  probably not the best tool for generating
  graphics)‫‏‬
#!/usr/bin/perl
                                                       Answer 12
use strict;
use warnings;

#TASK: demonstrate the use of “my” in setting the
#scope of a variable
my $some_variable = 100;

#body of the main program with the function call
print "the value of some_variable is: $some_variable\n";
subroutine1();
print "but here, some_variable is still: $some_variable\n";

#subroutine using $some_variable
sub subroutine1{
my $some_variable = 0;
print "in subroutine1,some_variable   is: $some_variable\n";
}


#what happens if you comment out "use strict" and
#remove "my" from lines 7 and 16
#!/usr/bin/perl
use strict;
use warnings;

#TASK: check your answers to the regex excercise
                                                         Answer 13
#open input and output files
open(IN,"myanswers.txt");


#read the input file line-by-line
#for each line test if it matches a regular expression
while(<IN>){
chomp;
my $is_correct = does_it_match($_);
if ($is_correct){
          print "$_ is a match\n";
}
else{
          print "$_ is NOT a match\n";
}
}

#close input file and exit
close(IN);
exit();


#does it match
sub does_it_match{
my($answer) = @_;
my $is_correct = 0;
if ($answer =~ m/^ATG?C*[ATCG]+?A{3,10}$/){
          $is_correct = 1;
}
return $is_correct;
}

								
To top