Perl Project by ygq15756

VIEWS: 0 PAGES: 12

									Perl Project




               Matt Gonwa
                REU 2006
        Because of my interest in applying mathematics and computation to biology, I
asked Dr. Robert Keenan, Assistant Professor in the Department of Biochemistry and
Molecular Biology, for advice in selecting a project. He pointed out that computer
programming, which facilitates the analysis of large biological data sets, is an invaluable
tool for the mathematical biologist. Thus he recommended that I learn Perl, a
programming language that is widely used in the bioinformatics community, in order to
develop some programs for working with DNA and protein sequences.

        He provided me with a folder of unprocessed DNA sequence files which had been
returned to him by the Cancer Research Center DNA Sequencing and Genotyping Core
Facility. I have developed programs to complete the following tasks:
    (1) Remove the whitespace from all DNA sequence files in a folder (Fig. 1, Fig.2).
    (2) Remove the “noise” which appears before the start codon and after the stop
        codon ( Fig.3).
    (3) Translate DNA sequences into protein sequences (Fig. 3).
    (4) Separate easily translatable protein sequences from protein sequences that require
        more complex translation algorithms.
    (5) Compare the amino acid composition of protein sequences as a function of
        position.

NNNNNNNNNNNNNNTCCCNTCTANANNATTTTGTTTAACTTTAAGAAGGAGATATACCATGGGCAGCAGCCATCATCA
TCATCATCACAGC
AGCGGCGAAAACCTG
TACTTCCAGGGTCATATGATTGAAGTCAAACCAATAAACGCGGAAGATACGTATGACATCAGGCACCGCATTCTCCGG
CCGAATCAGCC
GCTTGAAGCAAGTAT
GTATGAAACCGATTTGACCGGGAGTGCGTTTCACCTCGGTGGATTTTACCGGGGCAAGCTGATCAGCATCGCTTCCTTT
CATAAAGCCGA
ACATTCAGAGCTTCA
AGGCGAAGAACAGTATCAGCTGAGAGGGGTGGCGACGCTTGAAGGATACCGTGAGCAAAAAGCGGGAAGCACGCTCG
TCAAGCATG
CCGAAGAGATTCTT
CGGAAAAAGGGGGCAGACCTTATATGGAGCAATGCCAGGACATCTGTGAGCGGCTACTATGAAAAGCTCGGCTTCAGC
GAACAGGGC
GAAGTCTACGACA
CACCGCCGATCGGACCTCATATTTTGATGTATAAGAAATTGACGTAATCTCGAGCACCACCACCACCACNNCTGAGNTN
CNGNTGCTAAC
AAANNCNGAAAG
GAAGCTGAGTTGGCTGCTGCCACCGCTGANCAATANCTAGCANAACCCCCNTGGGNCTCTAAGCGGGTCTTGAGGGNT
TTTTGCTGAAAG
AGGAACTATNNNN
NGANNGNCNNNNGGGACGCGNCCTGNANNGCGCANNNANCGCGGCGGGTGTGGTGGTTACNCNCACGNGACCGCTNC
ACTTGGCAGCNN
CCTANCGCCCGCT
CCTTGCGCTTTCGTTCNNTTCCTTTCTCGCCGNNNNCGCCGGCTTTCCCCGTNNAGCTNTNAATCGNGNGCTCGNTGNNA
GGGTTCCGATTNA
TGCNNACGNCAC
CNNNACGCCANNAANTTGATNANGTNNNNNTTCAGTTANNGNCANCGGCNNNNNNACNNNNATCCGNCCTTGGGANN
NNAANNCNANNTNC
TNNAANAGGANN
CNAGNTNCGAANCGNGGNNNNNNNCTNNNACGCANNNTGNACNNNANNNNTTTGGAANNNTANANNNNNANNTTNN
CNNNNNTNNNGG
NNNNNNGGGGNNCAAANANNNNNNNN

Figure 1 – DNA sequence file with white space.
          The following program (format.pl) proceeds as follows:
    (1)    Accepts an input from the user. The user enters the directory where sequence
           files are stored.
    (2)    format.pl then opens the directory entered by the user.
    (3)    For each .seq file in the directory, the program opens the file and stores the
           contents of the file into an array then converts the array into a scalar, and
           substitutes no space („‟) for all new line characters.
    (4)    This scalar is then saved to the same file which was opened.

------------------------------------------------

#!/usr/bin/perl

print 'What is the directory where your sequence files are located (be
precise!)';
my $dirs = <STDIN>;                 #STDIN is user input
chomp $dirs;                        #chomp removes a new line character at end of
                                    #the variable
my $string = $dirs;                 #the directory name is saved as “$string”
opendir (DIR, $string) or print "FAIL"; #opens the directory or indicates
                                                     #failure by printing “fail” to user
                                                     #interface
         foreach (readdir DIR) {                     #a loop which runs through all files
                  if ($_ =~ m/.seq/) {               #in the directory selecting for
                  $_ = $string . $_;                 #.seq files
                  open(FILE,"$_") or print ($_, "\n");             #opens all .seq files
                  my @cow = <FILE>;                  #contents of file saved into @cow array
                  my $sow = join('', @cow); #array converted into scalar
                  $sow =~ s/(\n)//g; #substitute no space for new line characters
                  close FILE;                #file is closed
                  open (STYLE, ">$_") or die;               #same file name is opened and
                  select STYLE;                             #contents of $sow variable are
                  print $sow;                #written to file
                  }}
exit;
------------------------------------------------

NNNNNNNNNNNNNNTCCCNTCTANANNATTTTGTTTAACTTTAAGAAGGAGATATACCATGGGCAGCAGCCATCATCA
TCATCATCACAGCAGCGGCGAAAACCTGTACTTCCAGGGTCATATGATTGAAGTCAAACCAATAAACGCGGAAGATAC
GTATGACATCAGGCACCGCATTCTCCGGCCGAATCAGCCGCTTGAAGCAAGTATGTATGAAACCGATTTGACCGGGAG
TGCGTTTCACCTCGGTGGATTTTACCGGGGCAAGCTGATCAGCATCGCTTCCTTTCATAAAGCCGAACATTCAGAGCTTC
AAGGCGAAGAACAGTATCAGCTGAGAGGGGTGGCGACGCTTGAAGGATACCGTGAGCAAAAAGCGGGAAGCACGCTC
GTCAAGCATGCCGAAGAGATTCTTCGGAAAAAGGGGGCAGACCTTATATGGAGCAATGCCAGGACATCTGTGAGCGGC
TACTATGAAAAGCTCGGCTTCAGCGAACAGGGCGAAGTCTACGACACACCGCCGATCGGACCTCATATTTTGATGTATA
AGAAATTGACGTAATCTCGAGCACCACCACCACCACNNCTGAGNTNCNGNTGCTAACAAANNCNGAAAGGAAGCTGA
GTTGGCTGCTGCCACCGCTGANCAATANCTAGCANAACCCCCNTGGGNCTCTAAGCGGGTCTTGAGGGNTTTTTGCTGA
AAGAGGAACTATNNNNNGANNGNCNNNNGGGACGCGNCCTGNANNGCGCANNNANCGCGGCGGGTGTGGTGGTTAC
NCNCACGNGACCGCTNCACTTGGCAGCNNCCTANCGCCCGCTCCTTGCGCTTTCGTTCNNTTCCTTTCTCGCCGNNNNC
GCCGGCTTTCCCCGTNNAGCTNTNAATCGNGNGCTCGNTGNNAGGGTTCCGATTNATGCNNACGNCACCNNNACGCCA
NNAANTTGATNANGTNNNNNTTCAGTTANNGNCANCGGCNNNNNNACNNNNATCCGNCCTTGGGANNNNAANNCNAN
NTNCTNNAANAGGANNCNAGNTNCGAANCGNGGNNNNNNNCTNNNACGCANNNTGNACNNNANNNNTTTGGAANNN
TANANNNNNANNTTNNCNNNNNTNNNGGNNNNNNGGGGNNCAAANANNNNNNNN

Figure 2 – DNA sequence file with white space removed (after format.pl).

        After formatting the .seq files, we proceed by operating on all .seq files. We first
eliminate all sequence data before the „TGG‟ of the subsequence „CCATGG‟. „TGG‟ is
the “start codon”, that is, the part of the gene where translation into protein begins. We
then eliminate all sequence data following the “stop codon”, that is, the „TAA‟ codon of
the subsequence „TAATCTCGAG‟. The resulting DNA sequence is then translated into
protein sequence.

MGSSHHHHHHSSGENLYFQGHMIEVKPINAEDTYDIRHRILRPNQPLEASMYETDLTGSAFHLGGFYRGKLISIASFHKAEHS
ELQGEEQYQLRGVATLEGYREQKAGSTLVKHAEEILRKKGADLIWSNARTSVSGYYEKLGFSEQGEVYDTPPIGPHILMYKK
LT_

Figure 3 – The result of “translation”, namely, a protein sequence.

    The program dir.pl accomplishes the above by performing the following:
    (1) Prompts the user for the directory where all sequence files are stored.
    (2) Opens this directory and, for each .seq file in the directory, follows a somewhat
        complicated procedure which cuts off all sequence prior to the start codon and
        after the stop codon.
    (3) Applies a translation subroutine to the resulting shortened DNA sequence.
    (4) Saves the translated product into a .txt file with the same name as the .seq file.

------------------------------------------------

#!/usr/bin/perl -w
                     #dir.pl
print 'What is the directory where your sequence files are located (be
precise!)';          #prompts user for directory name

$dirs = <STDIN>;    #saves input to $dirs variable
chomp $dirs;        #cuts off the trailing new line from user input
$string = $dirs;
opendir (DIR, $string);    #opens the directory specified by the user
@a = readdir DIR;          #saves all files from the directory into the @a array
chomp @a;                  #cuts trailing new lines out of array

foreach (@a)               {  #runs a loop over all files in the user specified
                              #directory which are .seq files
         if ($_ =~ m/.seq/) {        #$_ is the default scalar which runs through
         $wow = $_;           #the @a array
         open (FILE, $_);     #opens each .seq file with a “filehandle”
         @sequence = <FILE>;         #saves the .seq file contents into an array
         chomp @sequence;            #cuts trailing new lines out of array
         $sequence = join ('', @sequence); #the array is converted to a scalar
         @seq = split ('CCATGG', $sequence);       #scalar split into two
      $seq[1] = 'ATGG' . $seq[1]; #element array, split at „CCATGG‟ substring
      $cutbeg = $seq[1]; #the sequence following the start codon is preceded
                           #by the actual start codon substring
      @cutend = split ('TAATCTCGAG', $cutbeg);         #the same procedure
                                         #is followed for the stop codon
      $cutend[0] = $cutend[0] . 'TAA';
      $gene = $cutend[0];
      my $protein = '';    #protein variable is set to empty string
      my $codon; #iterates over all three character substrings of the string
                    #held in the $gene scalar
                    #the $protein scalar is appended with the translation of
                    #each three character substring until we run out of
                    #characters
             for(my $i=0; $i < (length($gene) - 2) ; $i += 3){
             $codon = substr($gene,$i,3);
             $protein .= codon2aa($codon);
             }
      @sow = split ('', $wow);
      pop @sow;
      pop @sow;            #this code essentially pops off the .seq from the
      pop @sow;            #filename and adds .txt then saves $protein to
      pop @sow;            #a .txt file with the same name as the original
      $wow = join ('', @sow);     #.seq file
      $wow = $wow . ".txt";
      $add = ">$wow";
      open (WILE, $add);
      select WILE;
      print $protein;
      close WILE;
      }

close FILE;
}

closedir DIR;
exit;

sub codon2aa {                   #this is the translation subroutine
  my($codon) = @_;

    if ( $codon =~ /GC./i)     { return 'A' } # Alanine
  elsif ( $codon =~ /TG[TC]/i)    { return 'C' } # Cysteine
  elsif ( $codon =~ /GA[TC]/i)    { return 'D' } # Aspartic Acid
  elsif ( $codon =~ /GA[AG]/i)     { return 'E' } # Glutamic Acid
  elsif ( $codon =~ /TT[TC]/i)    { return 'F' } # Phenylalanine
  elsif ( $codon =~ /GG./i)     { return 'G' } # Glycine
   elsif ( $codon =~ /CA[TC]/i)   { return 'H' } # Histidine
   elsif ( $codon =~ /AT[TCA]/i) { return 'I' } # Isoleucine
   elsif ( $codon =~ /AA[AG]/i)   { return 'K' } # Lysine
   elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' } # Leucine
   elsif ( $codon =~ /ATG/i)     { return 'M' } # Methionine
   elsif ( $codon =~ /AA[TC]/i)   { return 'N' } # Asparagine
   elsif ( $codon =~ /CC./i)    { return 'P' } # Proline
   elsif ( $codon =~ /CA[AG]/i)   { return 'Q' } # Glutamine
   elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' } # Arginine
   elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' } # Serine
   elsif ( $codon =~ /AC./i)    { return 'T' } # Threonine
   elsif ( $codon =~ /GT./i)    { return 'V' } # Valine
   elsif ( $codon =~ /TGG/i)     { return 'W' } # Tryptophan
   elsif ( $codon =~ /TA[TC]/i)   { return 'Y' } # Tyrosine
   elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' } # Stop
   else { return '!'}

}
------------------------------------------------


       After translation, we proceed by separating the protein files (.txt) into two
categories. The first category contains all .txt files that are „well translated‟, that is, they
have no ! characters and only one _ character. This is accomplished by the dirhelp.pl
program which proceeds, roughly, as follows:
    (1) Prompts the user for directory where protein files are located.
    (2) Creates a subdirectory called “badcodons” in the directory where the protein
        sequence files are located.
    (3) Looks through all the .txt files in directory, if it finds a protein sequence with the
        ! character or multiple _ characters, the protein sequence file is moved into the
        „badcodons‟ directory. The DNA sequence file which the protein sequence was
        derived from is all moved into the „badcodons‟ subdirectory. These files are
        deleted from the parent directory.

------------------------------------------------

#!/usr/bin/perl

print 'What is the directory where your sequence files are located (be
precise!)';

$dirs = <STDIN>;           #prompts the user for directory name and saves this name
chomp $dirs;               #into the $dirs variable.
$string = $dirs;

opendir (DIR, $string);             #opens the directory which the user selected
@a = readdir DIR;
chomp @a;

$x = 1;

chdir $string or die;       #changes the working directory to that which the
mkdir "badcodons",0755;                    #user entered, creates the „badcodons‟
                            #subdirectory in this directory
       foreach $gert (@a) {         #this loop iterates over all .txt files in the
                                           #directory entered by the user
              if ($gert =~ /.txt/) {
              $hurt = $gert;               #the protein sequence filename is in
              $hurt =~ s/txt/seq/g;        #$gert variable while the corresponding
                                           #DNA sequence filename is in the $hurt
                                           #variable
              open (FILE, $gert) or die;
              open (HILE, $hurt) or die; #opens both of the files listed above
              $gertcont = <FILE>;
              $hurtcont = <HILE>;          #saves the contents of these files to
              close FILE;                  #variables and closes the filehandles
              close HILE;

                           if ($gertcont =~ m/!|(_.*_)/) { #if the ! character or
                                                #more than one _ characters are found
                           chdir ($string . 'badcodons') or die;    #in the variable
                                  #corresponding to the file, then this file is
                                  #transferred to badcodons subdirectory as is DNA
                                  #the DNA sequence file
                           open(GFILE, ">$gert") or die;
                           select GFILE;
                           print $gertcont;
                           close GFILE;

                           open(HFILE, ">$hurt") or die;
                           select HFILE;
                           print $hurtcont;
                           close HFILE;

                           chdir $string or die;     #and the original files in the
                           unlink $gert or die;      #user selected directory are
                           unlink $hurt or die;      #deleted
                           }
                  }
         }
exit;
------------------------------------------------
        After moving all the poorly translated protein (and corresponding DNA)
sequences into the „badcodons‟ subdirectory, we move all the well translated protein (and
corresponding DNA) sequences into a „goodcodons‟ subdirectory with a program called
aftdirhelp.pl.

------------------------------------------------

#!/usr/bin/perl

#<aftdirhelp.pl>
#This program takes what is left in the (textfile, sequence directory)
#and it moves these good codon files into a subfolder of the above
#directory
#so now we have all the bad codon files in badcodons subdirectory
#and all the good codon files in the goodcodons subdirectory
#note all these files have been unlinked from the original parent
#directory

print 'What is the directory where your sequence files are located (be
precise!)';

$dirs = <STDIN>;           #prompts user for name of parent directory where
chomp $dirs;               #sequence files were originally located
$string = $dirs;

chdir $string or die;               #changes working directory to the user entered
mkdir 'goodcodons';                 #directory
                                    #creates a „goodcodons‟ subdirectory
opendir (DIR, $string);
@a = readdir DIR;
chomp @a;

foreach $bowow (@a){
      if ($bowow =~ m/.txt|.seq/) {                 #moves all .txt or .seq files in the
      open(FILE, $bowow);                           #parent directory into the goodcodons
      $contents = <FILE>;                           #subdirectory
      close FILE;
      chdir ($string . 'goodcodon/') or            die;
      open(NEWFILE, ">$bowow");
      select NEWFILE;
      print $contents;
      close NEWFILE;
      chdir $string or die;
      unlink $bowow;
         }
}

exit;
------------------------------------------------


       We proceed to operate on the goodcodons subdirectory, with a program called
separate.pl. This program separates all protein sequence files (.txt) files into a
subdirectory called proteins.

------------------------------------------------

#!/usr/bin/perl

print 'What is the directory where your good codon files are located (be
precise!)';   #prompts for name of „goodcodon‟ directory

$dirs = <STDIN>;
chomp $dirs;
$string = $dirs;

chdir $string or die; #changes working directory to the goodcodons directory
mkdir 'proteins';     #creates a subdirectory called proteins

opendir (DIR, $string);
@a = readdir DIR; #a loop which iterates over all files in the goodcodons
chomp @a;           #parent directory and selects for protein sequence (.txt)
                    #files. These files are moved to the proteins subdirectory
foreach $protein (@a){     #and the original file in the goodcodons subdirectory
      if ($protein =~ m/.txt/){ #is unlinked (deleted).
      open(FILE, $protein);
      $protcont = <FILE>;
      close FILE;
      unlink $protein;

         chdir ($string . 'proteins/') or die;
         open(NEWFILE, ">$protein");
         select NEWFILE;
         print $protcont;
         close NEWFILE;
         chdir $string or die;
         }
}

exit;
------------------------------------------------

        Now we proceed to operate on the well translated protein files, examining each
amino acid position (168 positions) in order to see whether the amino acid is conserved at
that position or whether the amino acid varies at that position. We created a program
called align.pl to accomplish this task. It operates as follows:
    (1) Prompts the user for the directory name where the well translated protein files are
        located.
    (2) Changes the working directory to the directory mentioned above.
    (3) Opens the directory and reads all files in it, taking from each file the amino acid
        at a certain position.
    (4) The amino acid is entered into a string made up of nth position amino acids.
    (5) This string is then printed so the user can see all amino acids at the nth position in
        all protein sequence files.


------------------------------------------------

#!/usr/bin/perl

print 'What is the directory where your good codon protein sequence files are
located (be precise!)';
print "\n", 'is it c:\Documents and
Settings/Matt/Desktop/HT/HTsequences/goodcodons/proteins/', "\n?";
        #prompts the user for goodcondons/proteins directory
$dirs = <STDIN>;
chomp $dirs;
$directory = $dirs;

chdir $directory or die;   #changes the working directory to the proteins
                           #directory
opendir (DIR, $directory) or die; #opens the proteins directory and reads
@allfiles = readdir DIR;          #through all files
chomp @allfiles;

$i = 0;
@sequence = ();            #creates an empty list called @sequence

         foreach $protein (@allfiles){             #a loop which iterates over all
                                                   #files in the proteins
                                                   #directory
                  if($protein =~ m/.txt/){         #a condition selects for .txt (that is)
                                                   #protein sequence files
                  open(FILE, $protein);            #the protein sequence file is opened
                  $contents = <FILE>;              #and saved to a scalar variable
                  close FILE;                      #the protein sequence file is closed
             push (@sequence, $contents);    #adds the $contents
             }           #variable onto the end of the @sequence list

      }             #@sequence is a list with each element being a protein
                    #sequence file

@array = ();         #create a new list called @array
foreach $i (0...(length($contents)-1)){ #a loop which runs over all positions of
                                         #the protein sequence

      foreach $aa (@sequence) {         #a nested loop which runs over all
      $array[$i] .= substr($aa, $i, 1); #elements of the @sequence list
      }                                 #such all amino acids at a particular
}                                       #position are saved into the @array list
                                        #for example the n position amino
                                        #acids (from all the sequence files) are
                           #saved into the n position of the @array list

#@array[0-168] now holds a scalar with all the positional aas

$count = 0;
@ident = ();
@notident = ();
foreach $position (@array){      #a loop iterates over all elements of
                                 #@array list
      $posaa= substr($position, 1, 1);

             if ($position =~ m/[^"$posaa"]/) { #selects for elements of the
                            #@array list which are not all comprised of the
                            #same amino acid
             push(@notident, $count); #adds the aa position to @notident list
             push(@notident, $position);        #adds the n position aas to list
             }

             else {
             push(@ident, $count);             #same for position with no
             push(@ident, $position);          #amino acid variation
             }
$count++;
}

print "The identicals\n", "@ident";
print "The non-identicals\n", "@notident";
                                  #prints something that looks
      #like “The identicals
      #1 MMMMMMMMMMMMMMMMM 2 NNNNNNNNNNNN 5 GGGGGGGGGG etc.
         # “The non-identicals
         # 3 HHHHHHHHHHRHHHRR 4 TTTTSSTSTSSSTTSS etc.
exit;
------------------------------------------------

         Of course, a more thorough analysis of positional amino acid variation could (and
will) be made. It is also obvious to see that any amino acid insertion or deletion will
render the align.pl program useless. Fortunately, however, in the library of sequence
files that we analyzed (approximately 85 sequence files) none of the amino acid
sequences had insertions or deletions. Further sequence analysis will indicate whether
this is a general occurrence. If so, more complicated alignment algorithms (such as
BLAST) will not be needed.
         Furthermore, it is imperative to mention that a significant fraction of the protein
sequence files were not „well translated‟, that is, they contained the ! character or
multiple stop codons. This is probably due to insertion or deletion of nucleotides in the
DNA sequence. Further analysis, will reveal the source of „poor translation‟ and more
complicated algorithms will be developed to deal with the „badcodons‟ protein
sequences.

								
To top