Introduction to Programming Perl for Biologists

Document Sample
Introduction to Programming Perl for Biologists Powered By Docstoc
					Introduction to Programming: Perl for Biologists
Timothy M. Kunau
Center for Biomedical Research Informatics
Academic Health Center
University of Minnesota
kunau@umn.edu

Bioinformatics Summer Institute 2007




                                                   1
Introduction to Programming: Day two
Timothy M. Kunau
Center for Biomedical Research Informatics
Academic Health Center
University of Minnesota
kunau@umn.edu

Bioinformatics Summer Institute 2007




                                             2
Day I

•Art and Programming
•Getting Started
•Biology and Computer Science
•Bioinformatics Data
•Perl basics:
  •Strings and Variables
  •Math and Logic
  •Looping, operators, and functions

                                       3
Day II

•Assignment discussion

•Data from outside the program

•Writing out data

•Data into arrays and hashes

  •Array operations

•Scope and Good practices

•RegEx

                                 4
Day I: assignment review.

1. Calculate the reverse complement of a DNA strand using the tr/// operation.

2. Read about file handling. (Safari on-line documentation is available.)

3. Read about Regular Expressions (regex). (Safari)

4. Find CPAN.ORG and locate a module that would be useful to you as a biologist.

5. Read about that module and email me (kunau@umn.edu) the following details:

   1. Name of the module.

   2. The name of the person who wrote it.

   3. What it does.

   4. How it would be useful to you?



                                                                                   5
Day I: assignment review.

1. Calculate the reverse complement of a DNA
 strand using the tr/// operation.




                                               6
The tr/// operator (translate)

 • Match and replace what is in the first section,
   in order, with what is in the second.

 • $dna   =~ tr/[A-Z]/[a-z]/;     # lowercase

 • $dna   =~ tr/[A-Z]/[B-ZA]/;    # shift cipher

 • $dna   =~ tr/[ACGT]/[TGCA]/;   # revcom

 • $dna   = reverse($dna);


                                                    7


                                                        7
             s/// operator (substitute)

• Allows you to substitute whatever is matched in
  first section with value in the second section. (See
  m//.)

• $sport   =~ s/football/soccer/g;

• $tdfwinner   =~ s/Lance Armstrong/Ivan Basso/g;




                                                    8


                                                        8
Reverse compliment of a DNA strand

#!/usr/bin/perl -w
# Calculating the reverse complement of a strand of DNA

# The DNA
my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

print "Here is the starting DNA:\n\n$DNA\n\n";

# Calculate the reverse complement
my $revcom = reverse $DNA;

# The Perl translate/transliterate command is just what we need:
$revcom =~ tr/ACGTacgt/TGCAtgca/;

print "Here is the reverse complement DNA:\n\n$revcom\n";




                                                                   9
CPAN




       10
Day I: assignment review, CPAN modules


1. Name of the module.

2. The name of the person who wrote it.

3. What it does.

4. How it would be useful to you.




                                          11
Getting Data from Files

open(HANDLE, "contig2_MT.fa") || die $!;

while (defined($line = <HANDLE>)) {
  if( $line =~ /^\>/ ) {
        print $line, "\n";
  }
}

close(HANDLE);

% ./file-handles.pl
>ContigId:Contig2 AssemblyProcessId:MtSC AssemblyProcessVersion:1
                                                                    12


                                                                         12
Getting Data from Files
open(HANDLE, "contig2_MT.fa") || die $!;

while (<HANDLE>) {

    if( $_ =~ /^\>/ ) { # tests first line
          print $_, "\n"; # prints first line
    }
}

close(HANDLE);

% ./file-handlesII.pl
>ContigId:Contig2 AssemblyProcessId:MtSC AssemblyProcessVersion:1
                                                                    13


                                                                         13
Getting Data from Files

open(HANDLE, "contig2_MT.fa") || die $!;


@slurp = <HANDLE>;


print @slurp;         % ./file-handlesIII.pl
                      >ContigId:Contig2 AssemblyProcessId:MtSC AssemblyProcessVersion:1
                      GGGTATACTTCCTCCTCCATTGTTTGAGATATCACAAGACTTGAAATTGA
                      GCACGACCCATATTCTACTTCAAGGCGTTGAAGCAAAAACTCACCATGGG
                      AAACTAAACAGGTTAGTAAGTAGGCATCACCATCATTTTATATCGATATG
close(HANDLE);        GATAATAATGCACAAGACTTTCAAAGTTATCTTCAGATTCTTCCCCCTGT
                      TGAGTTTGCTTGCGTTTATGGATCATCTCTTCATCCAACCAATCATGACA
                      AGACAACCATGGTTGATTATATTCTTGGAGTTTCTGACCCTATACAATGG
                      CATTCTGAGAATCCGAAAATGAATAAGCATCACTATGCGTCATGGATGGT
                      GCACCTTGGTGGAGAGAGGCTGATTACCGCAGATGCAGATAAAATTGGTG
                      TGGGAGTACATTTCAACCCTTTTG



                                                                                          14


                                                                                               14
Pass data into a program

while(<STDIN>) {

    print “stdin read: $_”;

}




                              15


                                   15
Pass data into a program

open(GREP, “grep ‘>’ $filename”) || die $!;

my $i = 0;

while(<GREP>) {
  $i++;
}

close(GREP);

print “$i sequences in file\n”;

                                              16


                                                   16
Writing out data


open(OUT, “>outname”) || die $!;

print OUT “sequence report\n”;

close(OUT);




                                   17


                                        17
Writing out data


# appending with >>

open(OUT, “>>outname”) || die $!;

print OUT “append this\n”;

close(OUT);



                                    18


                                         18
Filehandles as variables


my $var = \*STDIN;




                           19


                                19
Filehandles as variables


open($fh, “>report.txt”) || die $!;


print $fh “line 1\n”;




                                      20


                                           20
Filehandles as variables


open($fh2, “report”) || die $!;

$fh = $fh2;


while(<$fh>) {

    something interesting goes here;

}
                                       21


                                            21
Zero based economy...

•The first element is ‘0’ for an index or first character in a string
 •computer scientists like it this way
 •as do most programming languages, including Perl
•Biologists often number first base in a sequence as ‘1’
 •GenBank
 •BioPerl
 •Interbase coordinates (Kent-UCSC, Chado-GMOD)
                                                                  22


                                                                       22
Coordinate systems

• Zero based, interbase coordinates
  A A T G G G T A G A
  0 1 2 3 4 5 6 7 8 9


• 1 based coordinates
     A T G G G T A G A
     1 2 3 4 5 6 7 8 9


                                      23


                                           23
Arrays as Lists

• Lists are sets of items
• Can be mixed types of scalars (numbers, strings, floats)
• Perl uses lists extensively
• Variables are prefixed by @



                                                            24


                                                                 24
List operations

 • reverse # reverse list order
 • $list[$n] # get the $n-th item
 • $two = $list[2]; # get which item?




                                        25


                                             25
List operations

 • reverse # reverse list order
 • $list[$n] # get the $n-th item
 • $three = $list[2]; # get the third   item




                                               26


                                                    26
List operations

 • scalar # get length of array
  • $len = scalar @list;
  • $last_index = $#list;
 • delete $list[10]; # delete entry




                                      27


                                           27
Autovivication
 • Autovivify : to bring oneself to life.
 • Automatically allocates space for an array item element:
    $array[0] = ‘apple’;

    $array[4] = ‘elephant’;
    $array[25] = ‘zebra’;

    delete $array[25];




                                                              28


                                                                   28
29
pop,push,shift,unshift
# remove last item
$last = pop @list;

# remove first item
$first = shift @list;

# add to end of list
push @list, $last;

# add to beginning of list
unshift @list, $first;
                             30


                                  30
splicing an array

splice ARRAY,OFFSET,LENGTH,LIST

splice ARRAY,OFFSET,LENGTH

splice ARRAY,OFFSET

splice ARRAY



                                  31


                                       31
splicing an array



@list = (‘alice’,’chad’,’rod’);

($x,$y) = splice(@list,1,2);

splice(@list, 1,0,(‘marvin’,’alex’));




                                    32


                                         32
Sorting with sort


@list = (‘tree’,’frog’, ‘log’);

@sorted = sort @list;

# reverse order
@sorted = sort { $b cmp $a } @list;




                                      33


                                           33
Sorting with arrays of numbers


@list = (25,21,12,17,9,8);


# sort based on numerics
@sorted = sort { $a <=> $b } @list;

# reverse order of sort
@revsorted = sort { $b <=> $a } @list;



                                         34


                                              34
                         #!/usr/bin/perl -w
LAB: files                #
                         # Reading protein sequence data file.

% pico files2arrays.pl
                         # File containing the sequence data
                         my $fastafilename = 'contig2_MT.fa';

                         # First we have to "open" the file
                         open(FASTAFILE, $fastafilename);

                         # Read the fastafrom file, and store it
                         # into the array variable @protein
                         @fasta = <FASTAFILE>;

                         # Print the protein onto the screen
                         print @fasta;

                         # Close the file.
                         close FASTAFILE;

                         exit;




                                                                   35
                         #!/usr/bin/perl -w
LAB: files                #
                         # Reading protein sequence data file.

% pico files2arrays.pl
                         # File containing the sequence data
                         my $fastafilename = 'contig2_MT.fa';

                         # First we have to "open" the file
                         open(FASTAFILE, $fastafilename) || die $!;

                         # Read the fastafrom file, and store it
                         # into the array variable @protein
                         @fasta = <FASTAFILE>;

                         # Print the protein onto the screen
                         print @fasta;

                         # Close the file.
                         close FASTAFILE;

                         exit;




                                                                      36
LAB: get a file in FASTA format

          http://www.ncbi.nlm.nih.gov/




                                         37
LAB: navigate to GenBank




                           38
LAB: search for your favorite protein




                                        39
LAB: favorite protein entries, change display




                                                40
LAB: change display to FASTA




                               41
LAB: we return to our      #!/usr/bin/perl -w
program, already in        #
progress                   # Reading protein sequence data file.

% pico kinase.fa           # File containing the sequence data
                           my $fastafilename = 'kinase.fa';

% pico files2arrays.pl     # First we have to "open" the file
                           open(FASTAFILE, $fastafilename) || die $!;

Add the name of the        # Read the fastafrom file, and store it
FASTA file you created to   # into the array variable @protein
the program.               @fasta = <FASTAFILE>;

Run the program.           # Print the protein onto the screen
                           print @fasta;

                           # Close the file.
                           close FASTAFILE;

                           exit;




                                                                        42
LAB: break it.

 What happens when?:

 1.   You added the file?

 2.   Did the error message go away?

 3. How would you protect your user from an error
   like this?

 Did you think that was harder than it needed to be?




                                                       43
                         #!/usr/bin/perl -w
                         # Reading data from a file using a loop
LAB: a safer method
                         # File containing the sequence data
% pico files2arrays.pl   my $fastafilename = 'kinase.fa';

% ./files2arrays.pl
                         open(FASTAFILE, $fastafilename) || die $!;

                         # Read file one line at a time and print
Run the program.         while ($protein = <FASTAFILE>) {
                                 print $protein;
                         }

                         close FASTAFILE;


                         exit;




                                                                      44
LAB:               #!/usr/bin/perl -w
breaking it.       # Reading data from a file using a loop

                   # File containing the sequence data
                   my $fastafilename = 'kinase.fa';
 Why is this
 more safe
 than reading      open(FASTAFILE, $fastafilename) || die $!;

 the file into an   # Read file one line at a time and print
 array?            while ($protein = <FASTAFILE>) {
                           print $protein;
                   }

                   close FASTAFILE;


                   exit;




                                                                45
A brief break




                46
Scope
                                          TM proctor & gamble




• Section or subsection of a program where a
  variable is valid.

• Defined by braces { }
• Use ‘my’ to declare variables.
• use strict;       # mandates declaration of
  variables.

• use   warnings;      # or ‘-w’ on shebang line

                                                                47


                                                                     47
Good practices

•   ‘my’ operator declares a variable or a list of variables to be local (private)
    to the enclosed block, subroutine, or file. It will also be recognized in
    blocks contained by that region.


•   The region in which the private variable is recognized is called its scope,
    variables declared with ‘my’ are called lexically scoped variables.


•   Lexical (private) variables are not recognized outside of their scope.


•   A private variable of a function will not be recognized in another function
    called by that function. If you want that to happen, declare the variable as
    ‘local’.


•   It is recommended that you declare all of your variables with ‘my’.
                                                                                     48


                                                                                          48
Someone else’s code

@list = (‘aardvark’, ‘baboon’, ‘cat’,
         ‘dog’,’lamb’,’kangaroo’);

for $animal ( @list )   {
  if( length($animal)   <= 3 ) {
    print “$animal is   noisy\n”;
  } else {
    print “$animal is   quiet\n”;
  }
}

                                        49


                                             49
Made more safe.

use warnings;
use strict;
my @list = (‘aardvark’, ‘baboon’, ‘cat’,
            ‘dog’,’lamb’,’kangaroo’);

for my $animal ( @list ) {
  if( length($animal) <= 3 ) {
    print “$animal is noisy\n”;
  } else {
    print “$animal is quiet\n”;
  }
}
                                           50


                                                50
Associative arrays or Hashes


     Array                                                     Hash   apple    12
                                                                      pear     3
          0        1         2         3       4        5
                                                                      cherry   30
                                                                      lemon    2
                            ‘aaron’
                  ‘steve’
         ‘john’



                                      ‘max’
                                              ‘juan’
                                                       ‘sue’
                                                                      peach    6
                                                                       kiwi    3




                                                                                    51

                                                                                     51
Associative arrays or Hashes


 • Like arrays, but instead of numbers as indices
   hashes use strings.

   my @array = (‘john’, ‘steve’, ‘aaron’,
   ‘max’, ‘juan’, ‘sue’);

   my %fruithash = ( ‘apple’ => 12, ‘pear’ =>
   3, ‘cherry’ =>30, ‘lemon’ => 2, ‘peach’ =>
   6, ‘kiwi’ => 3);




                                                    52

                                                     52
Using hashes
 • { } operator
 • Set a value
     $fruithash{‘cherry’} = 10;

 • Access a value
     print $fruithash{‘cherry’}, “\n”;

 • Remove an entry
     delete $fruithash{‘cherry’};
                                         53


                                              53
Get the Keys

• ‘keys’ function will return a list of the hash keys
 my @keys = keys %fruithash;

 for my $key ( keys %fruithash ) {
   print “$key => $hash{$key}\n”;
 }

• produces: ‘apple’, ‘pear’, ...
• Order of keys is NOT guaranteed!

                                                        54


                                                             54
Get just the values

• Similarly:
   # creates an array of hash values
   my @fruitcnt = values %fruithash;

   for my $itemcount ( @fruitcnt ) {
     print “val is $itemcount\n”;
   }


                                       55


                                            55
Iterate through a set

 • Order is not guaranteed!
 while( my ($key,$value) = each %fruithash){
     print “$key => $value\n”;
   }




                                               56


                                                    56
References

• Are “pointers” to the data object instead of object
  itself.

• A shorthand to refer to a variable and pass it
  around.

• Must “dereference” whatever is pointed at to get its
  actual value, the “reference” is just a location in
  memory.



                                                         57


                                                              57
Reference Operators
 •   \ in front gets its memory location

       my $ptr = \@vals;

 • Pointers can be assigned directly:
     •[   ] for arrays, { } for hashes

     my $ptr = [ (‘owlmonkey’, ‘lemur’)];

     my $hashptr = { ‘cdrom’ => ‘III’,     ‘start’ => 23};



                                                             58


                                                                  58
Dereferencing

• Need to cast reference back to datatype:
   my @list = @$ptr;

   my %hash = %$hashref;

• Can also use ‘{ }’ to clarify
   my @list = @{$ptr};

   my %hash = %{$hashref};
                                             59


                                                  59
Really not so hard...



   my @list = (‘fugu’, ‘human’, ‘worm’, ‘fly’);

   my $list_ref = \@list;

   my $list_ref_copy = [@list];

   for my $item ( @$list_ref ) {
     print “$item\n”;
   }


                                                  60


                                                       60
Why use references?

• Simplify argument passing to subroutines
• Allows updating data without making multiple copies.
• What if we wanted to pass in 2 arrays to a
  subroutine?

  sub func { my (@v1,@v2) = @_; }

• How do we know when one stops and another
  starts?

                                                         61


                                                              61
Why use references?

 • Passing in two arrays to intermix.
   sub func {
    my ($v1,$v2) = @_;
    my @mixed;

     while( @$v1 || @$v2 ) {
       push @mixed, shift @$v1 if @$v1;
       push @mixed, shift @$v2 if @$v2;
     }
     return \@mixed;
   }
                                          62


                                               62
References also allow Arrays of Arrays

   my @lst;
   push @lst, [‘milk’, ‘butter’, ‘cheese’];
   push @lst, [‘wine’, ‘sherry’, ‘port’];
   push @lst, [‘bread’, ‘bagels’, ‘croissants’];

   my @matrix = [ [1, 0, 0],
                  [0, 1, 0],
                  [0, 0, 1] ];




                                                   63


                                                        63
Hashes of arrays


  $hash{‘dogs’} = [‘beagle’, ‘shepherd’, ‘lab’];
  $hash{‘cats’} = [‘calico’, ‘tabby’, ‘siamese’];
  $hash{‘fish’} = [‘gold’,’beta’,’tuna’];

  for my $key (keys %hash ) {
    print “$key => “, join(“\t”, @{$hash{$key}}), “\n”;
  }




                                                          64


                                                               64
Subroutines

 • Set of code that can be
   reused.

 • Can also be referred to as
   procedures and functions.

 • Often theand refining your
   factoring
             result of re-

   solution.

 • Have little to do with
   submarines.
                                65


                                     65
Defining a subroutine

•   sub routine_name { }       # declaring a subroutine


•   Calling the routine:

      routine_name;

      &routine_name;       # & is optional




                                                          66


                                                               66
Passing data to a subroutine
 • Pass in a list of data
      &dosomething($var1,$var2);

      sub dosomething {
        my ($v1,$v2) = @_;
      }


      sub dosomethingelse {
        my $v1 = shift @_;
        my $v2 = shift;
      }
                                   67


                                        67
Returning data from a subroutine

• The last line of the routine sets the return value.
   sub dothis {
     my $c = 10 + 20;
   }

   print dothis(), “\n”;

• Better to specify return value and/or a condition to leave
   routine early.


                                                           68


                                                                68
Subroutine returns true (1) if codon is a stop codon
(standard genetic code)

sub is_stopcodon {
  my $val = shift @_;

    if( length($val) != 3 ) {
      return -1;
    } elsif( $val eq ‘TAA’ || $val eq ‘TAG’ ||
        $val eq ‘TGA’ ) {
      return 1;
    } else {
      return 0;
    }
}

                                                       69


                                                            69
                        #!/usr/bin/perl -w
                        # A program with a subroutine to append AAAAT to DNA

LAB: subroutines        # The original DNA
                        $dna = 'CGACGTCTTCTCAGGCGA';

 % pico subroutine.pl   # The call to the subroutine "addPOLYA".
                        # argument passed in is $dna; result is $longer_dna
                        $longer_dna = addPOLYA($dna);

                        print "I added AAAAT to $dna and got $longer_dna\n\n";


                        # Here is the definition for subroutine "addPOLYA"
                        sub addPOLYA {
                            my($dna) = @_;

                            $dna .= 'AAAAT';
                            return $dna;
                        }

                        exit;




                                                                                 70
LAB: break it.

 Can you?:

 1.   Create better variable names?

 2.Find a potential problem with subroutines and variable
  scope?

 3.   Get it to work with GLOBAL variables?

 4.   Explain why this might be a problem?




                                                            71
LAB: add to it.

 Can you?:

 1.   Find another way to concatenate the strings?

 2.Add a subroutine that provides a reverse transcription
  service?

 3.Test for a poly-A tail before adding a poly-A tail and add
  one only if it isn’t already there?

 4.Create a file of FASTA entries and run them through your
  program?



                                                                72
Funny operators

  my @bases = qw(C A G T);


  my $msg = <<EOF
  In his return from the ship to New York, he was
  discovered by the enemy as he passed near Governors
  Island, They took chase and in an effort to escape,
  Ezra Lee cast off the timed mine, as he imagined it
  retarded him in the heavy swells of the harbor. He was
  then spotted by his men waiting for his return on the
  shore and was safely retrieved. The freed magazine,
  which was set to go off at one hour, “drifted past
  Governors Island into the East River where it exploded
  with great violence, throwing large columns of water
  and pieces of wood high in the air.”

  EOF
  ;                                                        73


                                                                73
Regular Expressions (reg’-ex)

• Part of “amazing power” of Perl
• Considered by some to be the heart and soul of Perl.
• Provide a set of very powerful and flexible facilities for
   parsing and manipulating text.

• Syntax can be tricky.
• Worth the effort to learn!
• Do not be afraid.
                                                              74


                                                                   74
Regular Expressions: the secret

• Regular Expressions represent a small, nearly
   unrelated, programming language within the Perl
   programming language.

• ‘Regexes’ are symbiotic DNA.
• A state machine operating on strings.
• Do not be afraid.

                                                     75


                                                          75
A simple regex

 if( $fruit eq ‘apple’ ||
     $fruit eq ‘Apple’ ||
     $fruit eq ‘pear’) {
   print “ matched fruit $fruit\n”;
 }

 # becomes this

 if( $fruit =~ /[Aa]pple|pear/ ){
   print “matched fruit $fruit\n”;
 }

                                      76


                                           76
Regular Expression syntax

 • use the =~ operator to match
 •   if( $var =~ /pattern/ ) {}       # scalar context

 •   my ($a,$b) = ( $var =~ /(\S+)\s+(\S+)/ );

 •   if( $var !~ m// ) { }      # true if pattern doesn’t

 •   m/REGEXPHERE/   # match

 •   s/REGEXP/REPLACE/   # substitute

 •   tr/VALUES/NEWVALUES/      # translate



                                                            77


                                                                 77
DNA ambiguity chars: (reverse compliment)

 • aMino - {A,C}, Keto - {G,T}
 • puRines - {A,G}, prYmidines - {C,T}
 • Strong - {G,C}, Weak - {A,T}
 • H (Not G)- {ACT}, B (Not A), V (Not T), D(Not C)
   $str =~ tr/
   acgtrymkswhbvdnxACGTRYMKSWHBVDNX/
   tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX/;

                                                      78


                                                           78
m// operator (match)

• Search a string for a pattern match
• If no string is specified, will match $_
• Pattern can contain variables which will be interpolated (and
   pattern recompiled)

      while (<>) {
              print if /$pat/;
          }

      while (<>) {
              print if /$pat/o;
          }

                                                                  79


                                                                       79
Pattern extras: suffixes


 •   /i # case insensitive


 •   /g # global match (more than one)


 •   /x # extended regex (comments and whitespace)


 •   /o # compile regex once




                                                     80


                                                          80
Regex Operators

 \     escape character - used to a metacharacter like a period, brackets, etc.
 .     (period) match any character except newline
 x!    match any instance of x
 ^x !  match any character except x
 [x] ! match any instance of x in the bracketed range - [abxyz] will match any
          instance of a, b, x, y, or z
 |     (pipe) an OR operator - [x|y] will match an instance of x or y
 () ! used to group sequences of characters or matches
 {} ! used to define numeric quantifiers
 {x} ! match must occur exactly x times
 {x,} !match must occur at least x times
 {x,y} !match must occur at least x times, but no more than y times
 ? ! preceding match is optional or one only, same as {0,1}
 * ! find 0 or more of preceding match, same as {0,}
 + ! find 1 or more of preceding match, same as {1,}
 ^ ! match the beginning of the line
 $ ! match the end of a line

                                                                                  81
Regex: Character Operators

 \d !matches a digit, same as [0-9]
 \D !matches a non-digit, same as [^0-9]
 \s ! matches a whitespace character (space, tab, newline, etc.)
 \S ! matches a non-whitespace character
 \w !matches a word character
 \W !matches a non-word character




                                                                   82
Regex: POSIX Operators

 [:alnum:] !alphabetic and numeric characters
 [:alpha:] ! alphabetic characters
 [:blank:] !     space and tab
 [:cntrl:] ! control characters
 [:digit:] ! digits
 [:graph:] ! non-blank (not spaces and control characters)
 [:lower:] ! lowercase alphabetic characters
 [:print:] ! any printable characters
 [:punct:] ! punctuation characters
 [:space:] ! all whitespace characters (includes [:blank:], newline, carriage return)
 [:upper:] ! uppercase alphabetic characters
 [:xdigit:] ! digits allowed in a hexadecimal number (i.e. 0-9, a-f, A-F)




                                                                                        83
Regex: Additional Modules, 180+ found
 POSIX::Regex                                   Regexp::Common::number
 OO interface for the gnu regex engine          provide regexes for numbers
 POSIX-Regex-0.89 - 18 Aug 2006 - Paul Miller   Regexp-Common-2.120 - 15 Mar 2005 - Abigail

 Regexp::Common                                 Regexp::Common::profanity
 Provide commonly requested regular             provide regexes for profanity
 expressions                                    Regexp-Common-2.120 - 15 Mar 2005 - Abigail
 Regexp-Common-2.120 - 15 Mar 2005 - Abigail
                                                Regexp::English
 Regexp::Common::CC                             Perl module to create regular expressions
 provide patterns for credit card numbers.      more verbosely
 Regexp-Common-2.120 - 15 Mar 2005 - Abigail    Regexp-English-1.00 - 10 Jul 2005 - chromatic

 Regexp::Common::IRC                            Regexp::Ethiopic
 provide patterns for parsing IRC messages      Regular Expressions Support for Ethiopic
 Regexp-Common-IRC-0.02 - 18 Dec 2005 -         Script.
 Chris Prather                                  Regexp-Ethiopic-0.15 - 22 Nov 2006

 Regexp::Common::URI
 provide patterns for URIs.
 Regexp-Common-2.120 - 15 Mar 2005 - Abigail



                                                                                                84
Simple regex
my $line = “aardvark”;


if( $line =~ /aa/ ) {
   print “has a double aa\n”
}
if( $line =~ /(a{2})/ ) {
  print “has double aa\n”
}
if( $line =~ /(a+)/ ) {
  print “has 1 or more a\n”
}

                               85


                                    85
Matching gene names
# YFL001C YAR102W - yeast ORF names
# let-1, unc-7 - worm names

# ENSG000000101 - human Ensembl gene names

while(<IN>) {

    if( /^(Y([A-P])(R|L)(\d{3})(W|C)(\-\w)?)/ ) {

    printf “yeast gene %s, chrom %d,%s arm, %d %s strand\n”,

     $1, (ord($2)-ord(‘A’))+1, $3, $4;

    } elsif( /^(ENSG\d+)/ ) { print “human gene $1\n” }
    elsif( /^(\w{3,4}\-\d+)/ ) { print “worm gene $1\n”; }

}
                                                               86


                                                                    86
Regex GenBank record into FASTA components

my ($anno, $dna) = ($rec =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);




LOCUS appears at the beginning of the GenBank record,
followed by any number of characters including newlines
with .*, followed by the string ORIGIN, followed by possibly
some whitespace with \s*, followed by a newline \n.

This matches the annotation part of the GenBank record.



                                                                   87


                                                                        87
Putting it together


 A parser for output from a gene
 prediction program




                                   88


                                        88
                   GlimmerM (Version 3.0)
                Sequence name: BAC1Contig11
                 Sequence length: 31797 bp

                   Predicted genes/exons

Gene Exon Strand Exon                    Exon Range     Exon
  #    #         Type                                 Length

   1    1   +   Initial          13907        13985      79
   1    2   +   Internal         14117        14594     478
   1    3   +   Internal         14635        14665      31
   1    4   +   Internal         14746        15463     718
   1    5   +   Terminal         15497        15606     110

   2    1   +   Initial          20662        21143     482
   2    2   +   Internal         21190        21618     429
   2    3   +   Terminal         21624        21990     367

   3    1   -   Single           25351        25485     135

   4    1   +   Initial          27744        27804      61
   4    2   +   Internal         27858        27952      95
   4    3   +   Internal         28091        28576     486
   4    4   +   Internal         28636        28647      12
                            89
   4    5   +   Internal         28746        28792      47
   4    6   +   Terminal         28852        28954     103    89


   5    3   -   Terminal         29953        30037      85
Putting it together
  while(<>) {
   if(/^(Glimmer\S*)\s+\((.+)\)/ {
    $method = $1; $version = $2;
   } elsif( /^(Predicted genes)|(Gene)|(\s+\#)/ ||
            /^\s+$/ ) { next
   } elsif(                 # glimmer 3.0 output
            /^\s+(\d+)\s+    # gene num
              (\d+)\s+       # exon num
              ([\+\-])\s+    # strand
              (\S+)\s+       # exon type
              (\d+)\s+(\d+) # exon start, end
              \s+(\d+)       # exon length!    /ox ) {


  my ($genenum,$exonnum,$strand,$type,$start,$end,
      $len) = ( $1,$2,$3,$4,$5,$6,$7);
    }
  }
                                                         90


                                                              90
Day II: assignment.

1. Modify one of your existing programs to do something useful
 using a Regular Expression. (see the last lab)

2. Read about Perl DBI. (Safari on-line documentation is available.)

3. Read about BioPerl. (Safari and CPAN)

4. Write a paragraph describing what you hope to do with Perl in
 your BSI project and email it to me. (kunau@umn.edu)




                                                                       91
If you remember nothing else

•Biology is hard and messy: better tools will help.

•The key problems are social.
•Together we are smarter than any one of us.
•Technology is easy by comparison.




                                                      92
Questions?




             93
Thank You.




             94