Advanced Perl for Bioinformatics

Document Sample
Advanced Perl for Bioinformatics Powered By Docstoc
					Advanced Perl for
     Lecture 5
  Regular expressions - review
• You can put the pattern you want to match between //,
  bind the pattern to the variable with =~, then use it within
  a conditional:

       if    ($dna =~ /CAATTG/) {print “Eco RI\n”;}

• Square brackets within the match expression allow for
  alternative characters:

       if    ($dna =~ /CAG[AT]CAG/)

• A vertical line means “or”; it allows you to look for either of
  two completely different patterns:

     if     ($dna =~ /GAAT|ATTC/)
Reading and writing files, review
• Open a file for reading:

open INPUT,”/home/class30/input.txt”;

• Or writing

open OUTPUT,”>/home/class30/output.txt”;

• Make sure you can open it!
open INPUT, ”input.txt” or die “Can’t open file\n”;
Test time

Last one…

Perl has another super useful data structure
called a hash, for want of a better name.

A hash is an associative array – i.e. it
is an array of variables that are associated
with each other.
         Making a hash of it
• You can think of a hash just as if it were a
  set of questions and answers

my %matthash = (“first_name” => “Matt”,
               “surname” => “Hudson”,
               “age” => “secret”,
               “height” => 187, #cm
               “hairstyle” => “D minus”
         Getting the hash back
my %matthash = (“first_name” => “Matt”,
                    “surname” => “Hudson”,
                    “age” => “secret”,
                    “height” => 187, #cm
                    “hairstyle” => “D minus”

print “my name is “, $matthash{first_name};
print “ “, $matthash{surname}, “\n”;

You can store a lot of information and recover it easily
   and quickly without knowing in what order you added it,
   unlike an array.
        Hashes as an array

• You can get the “keys” of the hash and
  use them like an array:

foreach my $info (keys %matthash){
  print “$info = $matthash{$info}”;
 Why are hashes useful? Exercise.
• Many of you might have noticed in the exercise
  on restriction sites, that there was no way to
  keep track of which sites were which using

• Modify your script using a hash like this one:

my %enzymehash = (
“EcoRI” => “CAATTG”,
“BamHI” => “GGATCC”,
“HindIII” => “AAGCTT”);
                     (an) answer
foreach my $name (keys %enzymehash){
       if ($sequence =~ /$enzymehash{$name}/) {
           print “I found a site for $name,$enzymehash{$name}”;
          Putting data in a hash
my %hash;

while (<FILE>) {
       /stuff(important stuff) more stuff (best stuff)/;
       $hash{$1} = $2;
while ($line = <FILE>) {
       my @tmp = split /\t/, $line;
       $hash{$tmp[0]} = $tmp[1];
              Advanced regex
• The fun isn’t over yet.

• You can match precise numbers of characters
• Any number of characters
• Positions in a line
• Precise formatting (spaces, tabs etc)
• You can get bits of the string you matched out and
  store them in variables
• You can use regexes to substitute or to translate
        Grabbing bits of the regex
• The fun isn’t over yet.
  my $blastline = “Query= AT1g34399 gene CDS”;
  $blastline =~ /Query= (.+) gene/;
  my $atgnumber = $1;
  print “The accession number is $atgnumber\n”;

You can store the contents of the bit within brackets, within the
  regex, as the special variable $1. Then use it for other stuff.
  If you put another pair of brackets in, it will be stored in $2.
                  Using modules
• You can use other peoples modules, including
  those that come with Perl. These provide extra
  commands, or change the way your Perl script
  behaves. E.g.

use strict;
use warnings;
use Bio::Perl;

You will see these stacked up at the beginning of more complicated
Perl scripts. Some modules come with perl (strict, warnings)
#man perlmod
others you need to download and add in yourself.
          A last exercise?...
• So: how might hashes help you solve this?

• Open up a BLAST output file

• Spit out the name of the query sequence,
  the top hit, and how many hits there were.
       Programming projects
• Now it’s time to think of your programming

• Hopefully you have an idea – we’ll discuss
  how feasible they are in the time available

• If not, here are some suggestions
    Suggested program functions
•   Translate a cDNA into protein, and then check it against the pfam database
    for HMM hits.

•   Make a real restriction map of a DNA sequence, with predicted fragment sizes

•   Align proteins of a favorite family, open the alignment and find residues that
    are totally conserved.

•   Perform BLAST against the latest version of the database files for a particular
    organism – which will check whether the user has the latest files, and if not will
    download them

•   Design PCR primers, to make a fragment size chosen by the user, for a
    sequence input from a fasta file.

•   Check whether primer sites are unique in a sequenced, or partially
    sequenced, genome, and gives an “electronic PCR” result.

•   Output an XML formatted version of a BLAST or HMMER text file.

•   Analyze codon usage in a protein coding DNA sequence and calculate the
    Ka/Ks ratio

Shared By: