Perl Programming for Biology by bzs12927

VIEWS: 17 PAGES: 31

									1.1



      Perl Programming for Biology

               The Bioinformatics Unit
           G.S. Wise Faculty of Life Science
             Tel Aviv University, Israel
                     March 2009


           Eyal Privman and Dudu Burstein
1.2

      Why biologists need computers?
         Collecting and managing data
            http://www.ncbi.nlm.nih.gov/
         Searching databases
            http://www.ncbi.nlm.nih.gov/BLAST/
         Interpreting data
            Protein function prediction - http://smart.embl-
             heidelberg.de/
            Gene expression - http://www.bioconductor.org/
            Browsing genomes - http://genome.ucsc.edu/
1.3




      Why biologists need to program?
          (or: why are you here?)
1.4
      Why biologists need to program?
            A real life example
 Proto-oncogene activation by retroviral insertional mutagenesis
      c-Myc: an example for transformation caused by translocation,
      over- or misexpression.
      (In w.t. cells c-Myc is expressed only during the G1 phase).
1.5
                             A real life example
  >tumor1
  TAGGAAGACTGCGGTAAGTCGTGATCTGAGCGGTTCCGTTACAGCTGCTA
  CCCTCGGCGGGGAGAGGGAAGACGCCCTGCACCCAGTGCTG
  ...
  >tumor157

      Run BLAST: http://www.ncbi.nlm.nih.gov/BLAST/
      Click “Reformat these results”, choose “Show alignment as
      plain text”, click “view report” and save it to a text file:

                                                                   Score     E
                                                                                              Shmulik
  Sequences producing significant alignments:                      (bits) Value
  ref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic...    186   1e-45
  ref|NT_039353.4|Mm6_39393_34 Mus musculus chromosome 6 genomic c...     38   0.71
  ref|NT_039477.4|Mm9_39517_34 Mus musculus chromosome 9 genomic c...     36   2.8
  ref|NT_039462.4|Mm8_39502_34 Mus musculus chromosome 8 genomic c...     36   2.8
  ref|NT_039234.4|Mm3_39274_34 Mus musculus chromosome 3 genomic c...     36   2.8
  ref|NT_039207.4|Mm2_39247_34 Mus musculus chromosome 2 genomic c...     36   2.8

  >ref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic contig, strain C57BL/6J
            Length = 64849916

      Score = 186 bits (94), Expect = 1e-45
      Identities = 100/102 (98%)
      Strand = Plus / Plus

  Query: 1        taggaagactgcggtaagtcgtgatctgagcggttccgttacagctgctaccctcggcgg 60
                  ||||||||||||||| ||||||||||||||||||||||| ||||||||||||||||||||
  Sbjct: 23209391 taggaagactgcggtgagtcgtgatctgagcggttccgtaacagctgctaccctcggcgg 23209450
  ...
  ...
1.6
           A Perl script can do it for you
      Shmulik writes a simple Perl script to parse blast results and find all hits
      that are in the myc locus, or up to 10kbp from it:

      • Use the BioPerl package SearchIO
      • Open and read file “mice.blast”
      • Iteration – for each blast result:
              • If we hit the genomic sequence “Mm15_39661_34”
              • in the coordinates of the Myc locus
                (23,198,120 .. 23,223,004)
              • then print this hit (hit number and position in locus)
1.7Use the BioPerl package SearchIO                      Open file “mice.blast”
            Perl all blast results
         A Iterate overscript can do it for you
    Shmulik writes a simple Perl script to For each blast hit –and find all hits genomic
                                           parse blast results ask if we hit the
                                                sequence “Mm15_39661_34” in the
    that are in the myc locus, or up to 10kbp from it:
      If so – print hit name                       coordinates of the Myc locus
    use Bio::SearchIO;
           and position                                23,198,120..23,223,004
    my $blast_report = new Bio::SearchIO ('-format'=>'blast',
                                          '-file' =>'mice.blast');
    while (my $result = $blast_report->next_result)
    {
       print "Checking query ", $result->query_name, "...\n";
       my $hit = $result->next_hit();
       my $hsp = $hit->next_hsp();
       if ($hit->name() =~ m/Mm15_39661_34/
           && $hsp->hit->start() > 23198120
           && $hsp->hit->end()   < 23223004)
       {
          print "   hit ", $hit->name();
          print " (at position ", $hsp->hit->start(), ")\n";
       }
    }
1.8
              A Perl script can do it for you

      Checking query tumor1...
      hit ref|NT_039621.4|Mm15_39661_34 (at position 23209391)
      Checking query tumor2...
      Checking query tumor3...
      Checking query tumor4...
         hit ref|NT_039621.4|Mm15_39661_34 (at position 23211826)
      Checking query tumor5...
      Checking query tumor6...
      Checking query tumor7...
         hit ref|NT_039621.4|Mm15_39661_34 (at position 23210877)
      Checking query tumor8...
      Checking query tumor9...
      Checking query tumor10...
      Checking query tumor11...
         hit ref|NT_039621.4|Mm15_39661_34 (at position 23213713)
      Checking query tumor12...
1.9
                             What is Perl ?

      • Perl was created by Larry Wall.
            (read his forward to the book “Learning Perl”)
       Perl = Practical Extraction and Report Language
            (or: Pathologically Eclectic Rubbish Lister)

      • Perl is an Open Source project
      • Perl is a cross-platform programming language.
1.10
                          Why Perl ?

       • Perl is an Open Source project
       • Perl is a cross-platform programming language.

   • Perl is a very popular programming language,
     especially for bioinformatics
   • Perl allows a rapid development cycle
   • Perl is strong in text manipulation
   • Perl can easily handle files and directories
   • Perl can easily run other programs
1.11
                     Perl & biology

      BioPerl: “An international association of developers of
       open source Perl tools for bioinformatics, genomics and
       life science research”
       http://bioperl.org/
      Many smaller projects, and millions of little pieces of
       biological Perl code (which you should use as references
       – google and find them!)
1.12
                         This course
      Intended for students with no experience in programming, and
       for those who have taken one programming course in the past
       (e.g. A course in C or Pascal classes in high school)
      Requires more hours than your average seminar…
      Oriented towards programming tasks for molecular biology
1.13
                  Some formalities…
      Please read the course web page:
       http://bioinfo.tau.ac.il/~perluser/perl/tashsat
       Presentations will be available on the morning of the class.
      There will be 5-6 exercises, amounting to 30% of your grade.
       You get full points if you do the whole exercise, even if your
       answers are wrong.
      Exercises are for individual practice. DO NOT submit exercises
       in pairs or copy exercises from anyone.
      Submit your exercises by email to your metargel (either Eyal
       privmane@tau.ac.il or Dudu davidbur@tau.ac.il) and we will
       reply with our feedback.
      There will be a final exam on the computers in the computer
       classroom.
1.14
              Email list for the course
      Everybody please send us an email (privmane@tau.ac.il and
       davidbur@tau.ac.il) and write that you’re taking the course (even
       if you are not enrolled yet)
1.15
                 Example exercises
      Ex. 1: Write a script that prints "I will submit my
       homework on time" 100 times
       (by the end of this lesson!  )
      Ex. 3: Read a GenBank file and print coordinates of
       ORFs
      Ex. 5: Write a module of functions for reading
       sequence files and identification of palindromes
1.16
                          A first Perl script
  print "Hello world!";


  A Perl statement must end with a semicolon “;”
  The print function outputs some information to the terminal screen
1.17
                                    Data types
  Data Type                    Description
  scalar                 A single number or string value
       9   -17        3.1415       "hello"
  array                  An ordered list of scalar values
       (9,-15,3.5)


  associative_array      Also known as a “hash”. Holds an unordered list of
                         key-value couples.
       ('eyal' => 'privmane@tau.ac.il',
       'dudu' => 'davidbur@tau.ac.il')
1.18




       Scalar Data
1.19
                                Scalar values
  A scalar is either a string or a number.

  Numerical values
       3              -20              3.14152965
       1.3e4 (= 1.3 × 104 = 1,300)
       6.35e-14 ( = 6.35 × 10-14)
1.20
                         Scalar values
                              Strings
 Double-quoted strings                       Single-quoted strings
 print "hello world";                        print 'hello world';
 hello world                                 hello world
 print "hello\tworld";                       print 'a backslash-t: \t ';
 hello world                                 a backslash-t: \t

 print "a backslash: \\ ";
 a backslash: \
 print "a double quote: \" ";
 a double quote: "

       Backslash is an             Construct          Meaning
       “escape” character that          \n         Newline
       gives the next character         \t         Tab
       a special meaning:               \\         Backslash
                                        \”         Double quote
1.21
                               Operators
  An operator takes some values (operands), operates on them, and produces a
  new value.
  Numerical operators:         + - * /
                               ** (exponentiation)
                               ++ -- (autoincrement)
   print 1+1;
      2
   print ((1+1)**3);
      8
1.22
                               Operators
  An operator takes some values (operands), operates on them, and produces a
  new value.
  String operators:        .   (concatenate)
                           x   (replicate)
  e.g.
   print ('swiss'.'prot');
      swissprot
   print (('swiss'.'prot')x3);
      swissprotswissprotswissprot
1.23
                          String or number?
  Perl decides the type of a value depending on its context:
  (9+5).'a'                          (9x2)+1
  14.'a'                             ('9'x2)+1
  '14'.'a'                           '99'+1
  '14a'                              99+1
                                     100
  Warning: When you use parentheses in print make sure to put one pair of
  parantheses around the WHOLE expression:
  print (9+5).'a';             # wrong
  print ((9+5).'a');           # right
  You will know that you have such a problem if you see this warning:
  print (...) interpreted as function at ex1.pl line 3.
1.24
                                 Variables
  Scalar variables can store scalar values.
  Variable declaration                my $priority;
  Numerical assignment                $priority = 1;
  String assignment                   $priority = 'high';
  Assign the value of variable $b to $a
                             $a = $b;
  Note: Here we make a copy of $b in $a.
1.25
                      Variables - notes and tips
  Tips:
  • Give meaningful names to variables: e.g. $studentName is better than $n
  • Always use an explicit declaration of the variables using the my function

  Note: Variable names in Perl are case-sensitive. This means that the following
  variables are different (i.e. they refer to different values):
  $varname = 1;
  $VarName = 2;
  $VARNAME = 3;

  Note: Perl has a long list of scalar special variables ($_, $1, $2,…)
  So please don’t use them!
1.26
                   Variables - always use strict!
  Always include the line:
       use strict;
  as the first line of every script.
  • “Strict” mode forces you to declare all variables by my.
  • This will help you avoid very annoying bugs, such as spelling mistakes in the
  names of variables.
1.27
       Interpolating variables into strings
            $a = 9.5;
            print "a is $a!\n";
               a is 9.5!


            Reminder:
            print 'a is $a!\n';
               a is $a!\n
1.28
                             Reading input
       <STDIN> allows us to get input from the user:

       print "What is your name?\n";
       my $name = <STDIN>;
       print "Hello $name!";

       Here is a test run:

         What is your name?
         Shmulik
         Hello Shmulik
         !


  $name: "Shmulik\n"
1.29
                             Reading input
       Use the chomp function to remove the “new-line” from
       the end of the string (if there is any):

       print "What is your name?\n";
       my $name = <STDIN>;
       chomp $name;   # Remove the new-line
       print "Hello $name!";

       Here is a test run:

         What is your name?
         Shmulik
         Hello Shmulik!


  $name: "Shmulik\n"
         "Shmulik"
1.30
                      Built-in Perl functions:
                       The length function
  The length function returns the length of a string:
    print length("hi you");
         6
1.31
                         The substr function
  The substr function extracts a substring out of a string.
  It receives 3 arguments:    substr(EXPR,OFFSET,LENGTH)

  For example:
  $str = "university";
  $sub = substr ($str, 3, 5);
  $sub is now "versi", and $str remains unchanged.

  Note: If length is omitted, everything to the end of the string is returned.
  You can use variables as the offset and length parameters.
  The substr function can do a lot more, google it and you will see…

								
To top