EECS 368 Perl Project 120 points

Document Sample
EECS 368 Perl Project 120 points Powered By Docstoc
					EECS 368                                                   Perl Project: 120 points
February 27, 2008                                          Due date: see last page

                      Object-Oriented Flat File DB Interface
This Perl program will find and extract the segment number or set of patterns (lexicon)
for a given x-coordinate (the nucleotide number) of a chromosome (see definitions
below). The project introduces Perl modules, emphasizes encapsulation and the use of
parameters to functions, and program structure in general. The goal is to create a simple
interface to a flat file data base of segmented chromosome information. The interface is
to a database of flat files with text information about segments of chromosomes (long
text strings). Here is a picture of the file structure




In our project, there is only one species (Arabidopsis thaliana) and five chromosomes. In
~twclark/class/perlproj/etc there is an index file for each chromosome. Please ignore
lexiconfiles.txt.




The data files (*.out) in the directories contain a pattern-based description of segments
of the chromosome. The representation for the chromosome has the hierarchy:
chromosome, segment, patterns. An Analogy: Suppose a book with two sections and
15 chapters. The picture above represents 7 chapters in section 1 of the book (the 8
chapters in section 2 are not shown). Chapter 2 starts on page 12; Chapter 3 starts on
page 30 and so on; suppose the words for each chapter are organized together and
stored in a database. Our data: In our data, the book is Arabidopsis (one species), there
are 5 sections (rather than the 2 sections of the book) – each section is a chromosome,
the 7 chapters in the diagram correspond to 7 segments in a hypothetical chromosome.
The X coordinate is the nucleotide number, or page number in the book analogy. The
segment is the chapter in the book analogy.

The program you write will retrieve the lexicon for a given segment, where a lexicon is
just a list of words and their frequency. Here, the lexicon is made and stored for each


                                                                                       1
segment. Each segment/page has a contiguous range of nucleotides in it (as a page is
in a chapter and a chapter has a range of pages). The data files are at
~twclark/class/perlproj/arabidopsis/chr<n>/out

The files in the “out” directory are named p<n>.out. They contain blocks of data, one for
each segment, but are not in segment order (and some segments might be missing).
Here is an example for segment 47 with x-coordinate range 66199 through 66434 (the
last number can be ignored). (The .txt extension to 00047 is not meaningful to this
project, but could be useful in a regular expression.)

   >atchr2xx000047.txt 66199:66434 0.602000
   #sequence
   CGAACACTTTTCTCGATCCATCCCATCCGACGATCAGTCTGTCCGACCCGATCCGTCAGACG
   ATCGGCCTGTCCCATCCGACCCGT
   CCGACGATCGGTCTGTCCGATCCGTCTGGCCGATCCGATCCGTCTGACGATCGGTCTGGCCA
   ATCTGATCCAACGATCGGTCTACC
   AGATCCGATTCGTCTCTAAGACGAACGGTTTACCCTCTCCATCCTCTACACTCCATCGAACC
   GT
   #lexiconFreq
   C                        36     0.285714
   A                        22     0.174603
   T                        14     0.111111
   G                        12     0.0952381
   CGATC                     10     0.0793651
   CGA                       8      0.0634921
   GTCT                      7      0.0555556
   GTC                       5      0.0396825
   TCT                      4      0.031746
   GA                       3      0.0238095
   TCC                      3      0.0238095
   GATC                     2      0.015873
   #lexiconAlpha
   A                        22     0.174603
   C                        36     0.285714
   CGA                      8      0.0634921
   CGATC                    10     0.0793651
   G                        12     0.0952381
   GA                       3      0.0238095
   GATC                     2      0.015873
   GTC                      5      0.0396825
   GTCT                     7      0.0555556
   T                        14     0.111111
   TCC                      3      0.0238095
   TCT                      4      0.031746
   #parse
   CGA-A-C-A-C-T-T-T-TCT-CGATC-C-A-TCC-C-A-T-C-CGA-CGATC-A-GTCT-
   GTC-CGA-C-C-CGATC-C-GTC-A
   -GA-CGATC-G-G-C-C-T-GTC-C-C-A-T-C-CGA-C-C-C-GTC-CGA-CGATC-G-
   GTCT-GTC-CGATC-C-GTCT-G-G-
   C-CGATC-CGATC-C-GTCT-GA-CGATC-G-GTCT-G-G-C-C-A-A-TCT-GATC-C-A-
   A-CGATC-G-GTCT-A-C-C-A-G
   ATC-CGA-T-T-C-GTCT-C-T-A-A-GA-CGA-A-C-G-G-T-T-T-A-C-C-C-TCT-C-
   C-A-TCC-TCT-A-C-A-C-TCC-
   A-T-CGA-A-C-C-G-T
   > atchr2xx000060.txt 80124:81473 0.602000




                                                                                       2
    Etc.

The assignment is to write a program that extracts the lexicon from the
lexiconFreq section (the lexicon is highlighted bold above for segment 47) for a
given segment number, chromosome number, and species (we will only use
Arabidopsis in this project).

The parts of the project are:

A. Implement the program backend.pl with the functionality described below, and with
the subroutines that access the data in an object-oriented module, LexiconDB.pm. This
module will contain two public functions: (i) one function returns the lexicon for a species,
chromosome number, and coordinate number (nucleotide number); (ii) the other function
returns the segment number for a species, chromosome, and coordinate. You are to use
private functions in the module for the non-public interface (the public interface used by
backend.pl). The goals are to encapsulate related functionality, support reuse by other
programs, and enforce a public interface to the data and functionality. This module and
backend.pl will be used again during the course of the semester.

B. Write and document a test program to check the following public functions in
LexiconDB.pm by running backend.pl from the test program with the appropriate
command-line arguments to backend.pl described below.

C. PLEASE SUBMIT BY EMAIL IN A TAR FILE to eecs368@ittc.ku.edu
      a. Copies of your source code, LexiconDB.pm, backend.pl, and test.pl
      b. External documentation
      c. Sample output

The Flatfile Database
The data are organized by chromosome in directories, each directory with numerous
files as shown on page 1. In addition, for each chromosome there is one index file (GI
file) in the etc directory to assist in finding files and answering queries. These GI files are
          arabidopsis_chr1.gi
          arabidopsis_chr2.gi
          arabidopsis_chr3.gi
          arabidopsis_chr4.gi
          arabidopsis_chr5.gi
To service a query for a chromosome, your program should open the appropriate GI file,
and read its data into a Perl data structure of your design (a hash comes to mind). Each
line of the GI file has information for a chromosome segment e.g. here are two
consecutive lines from arabidopsis_chr1.gi

arabidopsis      chr1    000017 37619 37929 ./arabidopsis/chr1/out/p3.out 1
arabidopsis      chr1    000018 37930 46569 ./arabidopsis/chr1/out/p42.out 1

Each line gives the species, the chromosome number, the segment number, the low x
coordinate, the high x coordinate, the location and name of the output file with this
coordinate range, and the last integer to be ignored.

Thus, to service a query i) read the GI file and put it in a Perl data structure, ii) find the
appropriate line based on the query’s segment number or x-coordinate, iii) either return


                                                                                                 3
the segment id, or continue processing; if he lexicon is requested iv) open the indicated
p<n>.out file, <v> retrieve and return the lexicon from it for the required segment number.
At start up, backend.pl has the location of the GI files. All file access methods and
data structures should be in the object-oriented module used by backend.pl.

Queries should not take more than a couple seconds.

BACKEND.PL
Backend.pl provides an interface to the flatfile database through the command-line
arguments given to it. I.e., another program runs backend.pl to make a query to the
flatfile database through the command line arguments. An example query is
          Shell> backend.pl 1 arabidopsis chr2 1909
This query requests the lexicon (function code 1) for the segment containing nucleotide
1909 in chromosome 2. There are two function codes that the program will support,
function codes 1 and 2. These are described in the next paragraph.

The program backend.pl supports two functions for the flatfile database. These functions
and the corresponding command line arguments to backend.pl are:
1. Get Lexicon For X Coordinate
              input: function code, species, chromosome, coordinate
                           1       <species>      <chr<n>>         <x>
              output to STDOUT is the lexicon in descending frequency as follows:
error_code error_string number_words word_1 freq_1 count_1 … word_n freq_n count_n

2. Convert X Coordinate To Segment Id
           input: function code, species, chromosome, coordinate
                          2     <species>    <chr<n>>      <x>
           output to STDOUT contains the segment id padded with zeroes to length 5.
                   error_code error_string segment id

note: output string fields are delimited with one or more space or tab characters.

For example, one can invoke backend.pl as
       backend.pl 2 arabidopsis 18837 1
thereby requesting the lexicon for chromosome 2 of Arabidopsis in position 18837.

backend.pl has hard coded one directory, the directory where the etc directory
is located (see page 1). backend.pl interfaces the user program/client and uses
methods in the object-oriented module LexiconDB.pm to implement the two functions
above. LexiconDB.pm can use canonical Perl objects, or inside-out objects.

Client that uses the backend.pl server
The client is another program that invokes backend.pl. The client calls backend.pl with
command line arguments to direct backend.pl to use one of two “public” functions; the
client in turn receives output from backend.pl on STDOUT. In this project, test.pl will test
the client interface by invoking backend.pl with the appropriate strings and reading the
return values. The test program can be short and is of your design.
         $command = “backend.pl 2 arabidopsis 1 1001”;
         open(CX,"$command |") or die " $command did not work $!";
causing STDOUT to be available through the file handle CX (on the EECS network, see
~twclark/class/examples/pipedopen.pl ).


                                                                                          4
Example and Overview

The program, backend.pl (server) gets its input from the client in ARGV f; the client is
whatever runs backend.pl. The functions that access the flatfile database are
implemented in the Perl module LexiconDB.pm. Public FUNCTIONS will be called by
backend.pl (see backend.pl above); other functions are to be private. The program will
enforce public and private through language features.

For example, the program backend.pl may begin as follows (initialization not shown)
if (@ARGV < 4) {
   printf STDERR "usage: backend.pl <species> <func> <chr> <x>\n";
   printf STDOUT “6\tincorrect argument string\n”; # for caller
   exit(1);
}
my $species = $ARGV[0];
my $func = $ARGV[1]
my $chr = $ARGV[2];
my $coord = $ARGV[3]

if ($func eq 1 ) {
      GetLexiconForXCoordinate( …)
}
elsif ($func eq 2) {
      ConvertXCoordinateToSegmentId(…)
}
else {
   # unsupported function call; prepare an error for caller.
}

Error Codes for return values from two functions above, returned from functions
in backend.pl to client, here test.pl.

0: normal
1: cannot locate gi file
2: chromosome not in lexiconfiles.dat
3: species not in lexiconfiles.dat
4. invalid coordinate
5. coordinate segment missing
6. incorrect arguments to backend.pl
        Others may be added as needed




Due Dates and Credit

March 14              Description of your design (10 points)
March 24              120 (includes design points)
Extra Credit          +20 points for use of inside-out objects
Late submission:      - 30 points and 10 points maximum for extra credit


                                                                                           5
Enjoy!




         6