Perl for Bioinformatics Part 2

Document Sample
Perl for Bioinformatics Part 2 Powered By Docstoc
					Perl for Bioinformatics
          Part 2

      Stuart Brown

  NYU School of Medicine


•  Beginning Perl for Bioinformatics

  –  James Tisdall, O’Reilly Press, 2000

•  Using Perl to Facilitate Biological Analysis 
  in Bioinformatics: A Practical Guide (2nd Ed.)

  –  Lincoln Stein, Wiley-Interscience, 2001

•  Introduction to Programming and Perl

  –  Alan M. Durham, Computer Science Dept., Univ. of São Paulo, Brazil


•  Hopefully you were lucky enough to have
   some bugs in your programs from the first
   Perl exercise.

•  Test each line as you write 

  –  insert extra print statements to check on

         Perl Debugging Help

•  Add -w on the first line of your programs:

 #!usr/local/perl -w

  –  provides ‘warnings’

•  Add use strict as the 2nd line of your

  –  enforces proper variable names

  –  must initialize variables before using 

     (set to some initialvalue such as 0 or empty)

        Variable “Interpolation”

•  A variable holds a value $value = 6;

•  When you print the variable, Perl gives the value
   rather than the name of the variable.

print $value; 



•  If you put a variable inside double quotes, Perl
   substitutes the value (this is called variable interpolation)

print “The result is $value\n”


The result is 6

•  If you use single quotes, the variable name is used
   (interpolation is not used) 

    print ‘The result is $value\n’


The result is $value\n


•  A Perl program can take input from the

  –  The angle bracket operator (<>)takes input

  –  Usually this is assigned to a variable

     print “Please type a number: ”;

     $num = <>;

     print “Your number is $num\n”;


•  When data is entered from the keyboard, Perl waits for the
   Enter key to be typed

•  But the string which is captured includes a newline (carriage
   return) at its end

•  Perl uses the function chomp to remove the newline

   print    “Enter your name: ”;

   $name    = <>;

   print    “Hello $name, happy to meet you!\n”;

   chomp    $name;

   print “Hello $name, happy to meet you!\n”;                  

       Working with Text Files

•  To do real work, Perl has to read data out of
   text files and write results into output files

•  This is done in two steps

•  First, you must give the file a name within
   the script - this is known as a filehandle

•  Use the open command:

  open FILE1, ‘/u/schmoj01/Seqs/protein1.seq’;

            Read From the File

•  Once the file is open, you can read from it using
   the <> operator 

   –  (put the filehandle between the angle brackets)

•  Perl reads files one line at a time, each time you
   input data from the file, the next line is read:

      open FILE1, ‘/u/prot1.seq’;

      $line1 = <FILE1>;

      chomp $line1;

      $line2 = <FILE1>;


              Write to a File

•  Writing to a file is similar to reading from it

•  Use the > operator to open a file for writing:

open FILE1, ‘>/u/prot1.seq’;

•  This creates a new file with that name, or
   overwrites an existing file

•  Use >> to append text to an existing file

•  print to the file using the filehandle:

print FILE1 $data1;

          Making Decisons

•  Useful programs must be able to make some
   decisions on their own

•  The if operator is very powerful

•  It is generally used together with numerical
   or string comparison operators

  numerical: ==, !=, >, <, ≥, ≤

eq, ne, gt, lt, ge, le


•  Perl relies on the concept of True/False

•  Things are true if the math works.

•  The not operator ! reverses it 

print “positive number” if ! ($a < 0);

              Conditional Blocks

•  An if test can be used to control multiple lines
   of commands:

      print “Enter your age: ”;

      $age = <>;

      chomp $age;

      if ($age < 21) {

 print “You are too young for this kind of 


 die “too young”;


      print “You are old enough to know better!\n”;

•  If the test is true, execute all the command lines inside
   the {} brackets. If not, then go on past the closing } to
   the statements below.

•  If evaluates some statement in parentheses
   (must be true or false)

•  Note: conditional block is indented

   –  Perl doesn’t care about indents, but it makes your
      code more human readable

•  die is a special function - stops your script
   and prints its message

   –  Often used to test if keyboard input data is valid
      or if an input file exists. 

                   Else & Elseif

•  Instead of just letting the script go on if it fails the if
   test, you can designate a second block of code for
   the “or else” condition 

•  You can also perform multiple tests using elseif

       if $A = 10 {

print “yadda yadda”; # do some stuff

       } elseif $A > 10 {

print “yowsa yowsa”; # do different stuff

       } elseif $A < 10 {

print “do this other stuff”;

       } else $A {

print “if it ain\’t =, >, or <, then I’m stumped”

die “not a number”;



•  OK, we’ve got variables, input & output and
   decisions. Now we need Loops.

•  Loops test a condition and repeat a block of
   code based on the result

    –  while loops repeat while the condition is true

             $count = 1;

             while ($count <= 10) {

print “$count bottles of pop\n”;

                $count = $count +1;


             print “POP!\n”;

[Try this program yourself]

        Read a File: line by line

  open FILE1, ‘/u/doej01/prot1.seq’;

  while ($line = <FILE1>){



$my_sequence = $my_sequence . $line;


  close FILE1

•  Dumps the whole file into the variable


•  It is awkward to store a large DNA sequence in
   one variable, or to create many variables for a list
   of numbers

•  Perl has a type of variable called an “array” that
   can store a list of data

   –  multiple lines of a text file

   –  a list of numbers

   –  a list of words

•  Array variables are referred to with an “@”
@numbers = (1,2,45,234,11);

     Bioinformatics Uses Arrays

•  bioinformatics data often comes in the form of

   –  tab delimited lists

   –  multi-line text files

•  Arrays are handy because the entries are indexed

   –  You can grab the third number directly

       @numbers = (1, 2, 45, 234, 11);

       print “$numbers[3]\n”;


#Note - the index starts with zero!

      Read a File into an Array

•  Rather than read a file one line at time into a
   scalar variable, it is often helpful to read the
   entire file into an array

   open FILE1, ‘/u/doej01/prot1.seq’;

   @DNA = <FILE1>;

         join & substr

•  join combines the elements of an array into
   a single scalar variable (a string)

     $DNA = join('', @DNA);
                        spacer    which array
                     (empty here)

•  substr takes characters out of a string

  $letter = substr($DNA, $position, 1)

                   which string   where in        how many
                                  the string    letters to take

•  Read a DNA sequence from a text file

•  Calculate the %GC content

•  What about non-DNA characters in the file?

  –  carriage returns and blank spaces

  –  N’s or X’s or unexpected letters

•  Write the output to the screen and to a file 

  –  use append so that the file will grow as you run
     this program on additional sequences