Advanced Perl for Bioinformatics
Shared by: cuiliqing
-
Stats
- views:
- 6
- posted:
- 8/19/2012
- language:
- English
- pages:
- 17
Document Sample


Advanced Perl for
Bioinformatics
Lecture 5
Regular expressions - review
• You can put the pattern you want to match between //,
bind the pattern to the variable with =~, then use it within
a conditional:
if ($dna =~ /CAATTG/) {print “Eco RI\n”;}
• Square brackets within the match expression allow for
alternative characters:
if ($dna =~ /CAG[AT]CAG/)
• A vertical line means “or”; it allows you to look for either of
two completely different patterns:
if ($dna =~ /GAAT|ATTC/)
Reading and writing files, review
• Open a file for reading:
open INPUT,”/home/class30/input.txt”;
• Or writing
open OUTPUT,”>/home/class30/output.txt”;
• Make sure you can open it!
open INPUT, ”input.txt” or die “Can’t open file\n”;
Test time
Last one…
Hashes
Perl has another super useful data structure
called a hash, for want of a better name.
A hash is an associative array – i.e. it
is an array of variables that are associated
with each other.
Making a hash of it
• You can think of a hash just as if it were a
set of questions and answers
my %matthash = (“first_name” => “Matt”,
“surname” => “Hudson”,
“age” => “secret”,
“height” => 187, #cm
“hairstyle” => “D minus”
);
Getting the hash back
my %matthash = (“first_name” => “Matt”,
“surname” => “Hudson”,
“age” => “secret”,
“height” => 187, #cm
“hairstyle” => “D minus”
)
print “my name is “, $matthash{first_name};
print “ “, $matthash{surname}, “\n”;
You can store a lot of information and recover it easily
and quickly without knowing in what order you added it,
unlike an array.
Hashes as an array
• You can get the “keys” of the hash and
use them like an array:
foreach my $info (keys %matthash){
print “$info = $matthash{$info}”;
}
Why are hashes useful? Exercise.
• Many of you might have noticed in the exercise
on restriction sites, that there was no way to
keep track of which sites were which using
arrays
• Modify your script using a hash like this one:
my %enzymehash = (
“EcoRI” => “CAATTG”,
“BamHI” => “GGATCC”,
“HindIII” => “AAGCTT”);
(an) answer
foreach my $name (keys %enzymehash){
if ($sequence =~ /$enzymehash{$name}/) {
print “I found a site for $name,$enzymehash{$name}”;
}
}
Putting data in a hash
my %hash;
while (<FILE>) {
/stuff(important stuff) more stuff (best stuff)/;
$hash{$1} = $2;
}
Or….
while ($line = <FILE>) {
my @tmp = split /\t/, $line;
$hash{$tmp[0]} = $tmp[1];
}
Advanced regex
• The fun isn’t over yet.
• You can match precise numbers of characters
• Any number of characters
• Positions in a line
• Precise formatting (spaces, tabs etc)
• You can get bits of the string you matched out and
store them in variables
• You can use regexes to substitute or to translate
Grabbing bits of the regex
• The fun isn’t over yet.
my $blastline = “Query= AT1g34399 gene CDS”;
$blastline =~ /Query= (.+) gene/;
my $atgnumber = $1;
print “The accession number is $atgnumber\n”;
You can store the contents of the bit within brackets, within the
regex, as the special variable $1. Then use it for other stuff.
If you put another pair of brackets in, it will be stored in $2.
Using modules
• You can use other peoples modules, including
those that come with Perl. These provide extra
commands, or change the way your Perl script
behaves. E.g.
use strict;
use warnings;
use Bio::Perl;
You will see these stacked up at the beginning of more complicated
Perl scripts. Some modules come with perl (strict, warnings)
#man perlmod
others you need to download and add in yourself.
A last exercise?...
• So: how might hashes help you solve this?
• Open up a BLAST output file
• Spit out the name of the query sequence,
the top hit, and how many hits there were.
Programming projects
• Now it’s time to think of your programming
projects.
• Hopefully you have an idea – we’ll discuss
how feasible they are in the time available
• If not, here are some suggestions
Suggested program functions
• Translate a cDNA into protein, and then check it against the pfam database
for HMM hits.
• Make a real restriction map of a DNA sequence, with predicted fragment sizes
• Align proteins of a favorite family, open the alignment and find residues that
are totally conserved.
• Perform BLAST against the latest version of the database files for a particular
organism – which will check whether the user has the latest files, and if not will
download them
• Design PCR primers, to make a fragment size chosen by the user, for a
sequence input from a fasta file.
• Check whether primer sites are unique in a sequenced, or partially
sequenced, genome, and gives an “electronic PCR” result.
• Output an XML formatted version of a BLAST or HMMER text file.
• Analyze codon usage in a protein coding DNA sequence and calculate the
Ka/Ks ratio
Get documents about "