Perl Programming for Biology by bzs12927



      Perl Programming for Biology

               The Bioinformatics Unit
           G.S. Wise Faculty of Life Science
             Tel Aviv University, Israel
                     March 2009

           Eyal Privman and Dudu Burstein

      Why biologists need computers?
         Collecting and managing data
         Searching databases
         Interpreting data
            Protein function prediction - http://smart.embl-
            Gene expression -
            Browsing genomes -

      Why biologists need to program?
          (or: why are you here?)
      Why biologists need to program?
            A real life example
 Proto-oncogene activation by retroviral insertional mutagenesis
      c-Myc: an example for transformation caused by translocation,
      over- or misexpression.
      (In w.t. cells c-Myc is expressed only during the G1 phase).
                             A real life example

      Run BLAST:
      Click “Reformat these results”, choose “Show alignment as
      plain text”, click “view report” and save it to a text file:

                                                                   Score     E
  Sequences producing significant alignments:                      (bits) Value
  ref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic...    186   1e-45
  ref|NT_039353.4|Mm6_39393_34 Mus musculus chromosome 6 genomic c...     38   0.71
  ref|NT_039477.4|Mm9_39517_34 Mus musculus chromosome 9 genomic c...     36   2.8
  ref|NT_039462.4|Mm8_39502_34 Mus musculus chromosome 8 genomic c...     36   2.8
  ref|NT_039234.4|Mm3_39274_34 Mus musculus chromosome 3 genomic c...     36   2.8
  ref|NT_039207.4|Mm2_39247_34 Mus musculus chromosome 2 genomic c...     36   2.8

  >ref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic contig, strain C57BL/6J
            Length = 64849916

      Score = 186 bits (94), Expect = 1e-45
      Identities = 100/102 (98%)
      Strand = Plus / Plus

  Query: 1        taggaagactgcggtaagtcgtgatctgagcggttccgttacagctgctaccctcggcgg 60
                  ||||||||||||||| ||||||||||||||||||||||| ||||||||||||||||||||
  Sbjct: 23209391 taggaagactgcggtgagtcgtgatctgagcggttccgtaacagctgctaccctcggcgg 23209450
           A Perl script can do it for you
      Shmulik writes a simple Perl script to parse blast results and find all hits
      that are in the myc locus, or up to 10kbp from it:

      • Use the BioPerl package SearchIO
      • Open and read file “mice.blast”
      • Iteration – for each blast result:
              • If we hit the genomic sequence “Mm15_39661_34”
              • in the coordinates of the Myc locus
                (23,198,120 .. 23,223,004)
              • then print this hit (hit number and position in locus)
1.7Use the BioPerl package SearchIO                      Open file “mice.blast”
            Perl all blast results
         A Iterate overscript can do it for you
    Shmulik writes a simple Perl script to For each blast hit –and find all hits genomic
                                           parse blast results ask if we hit the
                                                sequence “Mm15_39661_34” in the
    that are in the myc locus, or up to 10kbp from it:
      If so – print hit name                       coordinates of the Myc locus
    use Bio::SearchIO;
           and position                                23,198,120..23,223,004
    my $blast_report = new Bio::SearchIO ('-format'=>'blast',
                                          '-file' =>'mice.blast');
    while (my $result = $blast_report->next_result)
       print "Checking query ", $result->query_name, "...\n";
       my $hit = $result->next_hit();
       my $hsp = $hit->next_hsp();
       if ($hit->name() =~ m/Mm15_39661_34/
           && $hsp->hit->start() > 23198120
           && $hsp->hit->end()   < 23223004)
          print "   hit ", $hit->name();
          print " (at position ", $hsp->hit->start(), ")\n";
              A Perl script can do it for you

      Checking query tumor1...
      hit ref|NT_039621.4|Mm15_39661_34 (at position 23209391)
      Checking query tumor2...
      Checking query tumor3...
      Checking query tumor4...
         hit ref|NT_039621.4|Mm15_39661_34 (at position 23211826)
      Checking query tumor5...
      Checking query tumor6...
      Checking query tumor7...
         hit ref|NT_039621.4|Mm15_39661_34 (at position 23210877)
      Checking query tumor8...
      Checking query tumor9...
      Checking query tumor10...
      Checking query tumor11...
         hit ref|NT_039621.4|Mm15_39661_34 (at position 23213713)
      Checking query tumor12...
                             What is Perl ?

      • Perl was created by Larry Wall.
            (read his forward to the book “Learning Perl”)
       Perl = Practical Extraction and Report Language
            (or: Pathologically Eclectic Rubbish Lister)

      • Perl is an Open Source project
      • Perl is a cross-platform programming language.
                          Why Perl ?

       • Perl is an Open Source project
       • Perl is a cross-platform programming language.

   • Perl is a very popular programming language,
     especially for bioinformatics
   • Perl allows a rapid development cycle
   • Perl is strong in text manipulation
   • Perl can easily handle files and directories
   • Perl can easily run other programs
                     Perl & biology

      BioPerl: “An international association of developers of
       open source Perl tools for bioinformatics, genomics and
       life science research”
      Many smaller projects, and millions of little pieces of
       biological Perl code (which you should use as references
       – google and find them!)
                         This course
      Intended for students with no experience in programming, and
       for those who have taken one programming course in the past
       (e.g. A course in C or Pascal classes in high school)
      Requires more hours than your average seminar…
      Oriented towards programming tasks for molecular biology
                  Some formalities…
      Please read the course web page:
       Presentations will be available on the morning of the class.
      There will be 5-6 exercises, amounting to 30% of your grade.
       You get full points if you do the whole exercise, even if your
       answers are wrong.
      Exercises are for individual practice. DO NOT submit exercises
       in pairs or copy exercises from anyone.
      Submit your exercises by email to your metargel (either Eyal or Dudu and we will
       reply with our feedback.
      There will be a final exam on the computers in the computer
              Email list for the course
      Everybody please send us an email ( and and write that you’re taking the course (even
       if you are not enrolled yet)
                 Example exercises
      Ex. 1: Write a script that prints "I will submit my
       homework on time" 100 times
       (by the end of this lesson!  )
      Ex. 3: Read a GenBank file and print coordinates of
      Ex. 5: Write a module of functions for reading
       sequence files and identification of palindromes
                          A first Perl script
  print "Hello world!";

  A Perl statement must end with a semicolon “;”
  The print function outputs some information to the terminal screen
                                    Data types
  Data Type                    Description
  scalar                 A single number or string value
       9   -17        3.1415       "hello"
  array                  An ordered list of scalar values

  associative_array      Also known as a “hash”. Holds an unordered list of
                         key-value couples.
       ('eyal' => '',
       'dudu' => '')

       Scalar Data
                                Scalar values
  A scalar is either a string or a number.

  Numerical values
       3              -20              3.14152965
       1.3e4 (= 1.3 × 104 = 1,300)
       6.35e-14 ( = 6.35 × 10-14)
                         Scalar values
 Double-quoted strings                       Single-quoted strings
 print "hello world";                        print 'hello world';
 hello world                                 hello world
 print "hello\tworld";                       print 'a backslash-t: \t ';
 hello world                                 a backslash-t: \t

 print "a backslash: \\ ";
 a backslash: \
 print "a double quote: \" ";
 a double quote: "

       Backslash is an             Construct          Meaning
       “escape” character that          \n         Newline
       gives the next character         \t         Tab
       a special meaning:               \\         Backslash
                                        \”         Double quote
  An operator takes some values (operands), operates on them, and produces a
  new value.
  Numerical operators:         + - * /
                               ** (exponentiation)
                               ++ -- (autoincrement)
   print 1+1;
   print ((1+1)**3);
  An operator takes some values (operands), operates on them, and produces a
  new value.
  String operators:        .   (concatenate)
                           x   (replicate)
   print ('swiss'.'prot');
   print (('swiss'.'prot')x3);
                          String or number?
  Perl decides the type of a value depending on its context:
  (9+5).'a'                          (9x2)+1
  14.'a'                             ('9'x2)+1
  '14'.'a'                           '99'+1
  '14a'                              99+1
  Warning: When you use parentheses in print make sure to put one pair of
  parantheses around the WHOLE expression:
  print (9+5).'a';             # wrong
  print ((9+5).'a');           # right
  You will know that you have such a problem if you see this warning:
  print (...) interpreted as function at line 3.
  Scalar variables can store scalar values.
  Variable declaration                my $priority;
  Numerical assignment                $priority = 1;
  String assignment                   $priority = 'high';
  Assign the value of variable $b to $a
                             $a = $b;
  Note: Here we make a copy of $b in $a.
                      Variables - notes and tips
  • Give meaningful names to variables: e.g. $studentName is better than $n
  • Always use an explicit declaration of the variables using the my function

  Note: Variable names in Perl are case-sensitive. This means that the following
  variables are different (i.e. they refer to different values):
  $varname = 1;
  $VarName = 2;
  $VARNAME = 3;

  Note: Perl has a long list of scalar special variables ($_, $1, $2,…)
  So please don’t use them!
                   Variables - always use strict!
  Always include the line:
       use strict;
  as the first line of every script.
  • “Strict” mode forces you to declare all variables by my.
  • This will help you avoid very annoying bugs, such as spelling mistakes in the
  names of variables.
       Interpolating variables into strings
            $a = 9.5;
            print "a is $a!\n";
               a is 9.5!

            print 'a is $a!\n';
               a is $a!\n
                             Reading input
       <STDIN> allows us to get input from the user:

       print "What is your name?\n";
       my $name = <STDIN>;
       print "Hello $name!";

       Here is a test run:

         What is your name?
         Hello Shmulik

  $name: "Shmulik\n"
                             Reading input
       Use the chomp function to remove the “new-line” from
       the end of the string (if there is any):

       print "What is your name?\n";
       my $name = <STDIN>;
       chomp $name;   # Remove the new-line
       print "Hello $name!";

       Here is a test run:

         What is your name?
         Hello Shmulik!

  $name: "Shmulik\n"
                      Built-in Perl functions:
                       The length function
  The length function returns the length of a string:
    print length("hi you");
                         The substr function
  The substr function extracts a substring out of a string.
  It receives 3 arguments:    substr(EXPR,OFFSET,LENGTH)

  For example:
  $str = "university";
  $sub = substr ($str, 3, 5);
  $sub is now "versi", and $str remains unchanged.

  Note: If length is omitted, everything to the end of the string is returned.
  You can use variables as the offset and length parameters.
  The substr function can do a lot more, google it and you will see…

To top