Introduction to UNIX and Perl by gqc20907

VIEWS: 11 PAGES: 10

									                                                                                                 Definitions


                                                                Operating System
       Introduction to UNIX and Perl                               • provides a uniform interface between a computer’s
                                                                   hardware and user-level programs.
                                                                   • Manages the low-level functionality of the hardware
                                                                   automatically.
                      Todd Scheetz
                                                                Programming Language
        Computational Methods in Molecular Biology
                                                                   • provides a formal structure/syntax for implementing
                      Feb. 18, 2003                                algorithmic procedures.




                   What is UNIX?                                               What is UNIX? (part 2)

Operating system developed at Bell Labs.                        Made available with source code at no cost
   • originally written in assembly code                          • could fix bugs, add features or just test alternative methods
   • the C programming language was designed to implement         • EXCELLENT for learning or teaching
   a more portable version of UNIX
                                                                Adopted by Berkeley to make BSD
Multi-user                                                         • virtual memory
Multi-tasking                                                      • paging
                                                                   • networking (TCP/IP)




                What is UNIX? (part 3)                                                   UNIX hierarchy
                                                                                      User i/f
  By programmers, for programmers                                                                        Users
  • extensive facilities to allow people to work together and           Library i/f
  share information in controlled ways                                                           Std. Utility Programs
  • time sharing system                                                                          (shell, editor, compiler)
                                                                 System call i/f
                                                                                                               (open, close, fork,
                                                                                       Standard Libr.          read, print, etc.)
  Basic Guidelines
     • Principle of least surprise                                                                   (process mgmt, memory mgmt,
     • every program should do one thing and do it well                       UNIX O/S               file system, I/O, etc.)

                                                                        Hardware (CPU, memory, disks, keyboard, etc.)


                                                                Adapted from Tanenbaum, p. 273




                                                                                                                                     1
                       UNIX Basics                                                           UNIX Basics

User Accounts - required to log-on to the computer with             File Sharing - Regulated by three sets of permissions.
username and password.
                                                                        Permissions: read, write, execute
Groups - entity made up of one or more users.
                                                                        Subjects: owner, group, all                  -rwxr-xr-x      foo.pl
Sharing...                                                                                                           -r-xr-xr-x      bar.pl
                                                                     RWXUser (u)Group (g)All (a)                     -rw-------      secret
        Bob                      Diane                                                                               -rw-r--r--      public
                      Bill
                                  Mike
             Stacie
         group1              group2




                       UNIX Basics                                                           UNIX Basics
                                                                       UNIX Filesystem Hierarchy
                                                                                                                /

      Super-user account
         complete access to all files
                                                                                    bin     dev       etc    lib     tmp    usr     var
      Required for system administration tasks
         add accounts/groups                                        Two shortcuts
         change permissions/owners of any file                       . - the current directory               bin      doc     lib         local
         change password of any account                              .. - the directory one level “up”
         shutdown a machine                                          /usr                                           bin     etc           lib     tmp
                                                                     /usr/bin
                                                                     /usr/local
                                                                     /usr/local/bin




                      What is UNIX?                                                        What is UNIX?
Processes                                                         grep - show every line from a file that matches a supplied pattern
                                                                      Ex. grep sub my_program.pl
Each program executes as a process                                    (would return every line in the file that contained the string ‘sub’)

A process provides encapsulation for the program                  ls - list files
                                                                       Ex. ls *.pl
Under UNIX, multiple processes can be running at the same time!        (would list all files in the current directory that end in ‘.pl’)

How to control processes:                                         head - list the first lines in a file
      ^C -- break                                                    Ex. head -20 my_program.pl
      ^Z -- stop                                                     (would show the first 20 lines from my_program.pl)
      & -- start in background
      ps -- show which processes are running                      sort - performs a lexical sorting of a file
      kill -- kill a process                                          Ex. sort my_program.pl




                                                                                                                                                        2
                        What is UNIX?                                       UNIX Basics
                                                          UNIX Command Summary

UNIX also provides a method for concatenating multiple    pwd - print working directory
programs together                                         cd - change directory
                                                          ls - list files
Pipes…                                                    mv - move a file (relocate/rename)
                                                          rm - remove a file
Ex.                                                       cp - copy a file
      head -20 *.pl | grep File | sort
                                                          mkdir - make a new directory
                        pipes                             rmdir - remove a directory
                                                          more - display the contents of a file (one screen as a time)

                                                          chmod - change the permissions on a file
                                                          chgrp - change the group associated with a file




                           UNIX Shell                                        UNIX Shell
        Shells                                           bash

        a.k.a. command interpreter                       prompt -- by default shows who you are, what machine the
        the primary user interface to UNIX               shell is running on, and what directory you are in.
        interpret and execute commands
                                                         PATH -- environment variable that defines where the shell
        1. Interactive use                               should look for the programs you are running.
        2. Customization of UNIX session (environment)      /bin
        3. programmability                                  /usr/bin
                                                            /usr/local/bin
        /bin/sh - Bourne shell                              /usr/X11R6/bin
        /bin/csh - C shell                                  /usr/sbin
        /bin/bash - Bourne again shell                      .
        /bin/tcsh - modified, updated C shell




                     Installing Software                               Mini-Tour of UNIX
        Pre-built vs. source
                                                            Go through the most common commands.
        RPM vs. “raw” binaries

        Process
           downloading
           extracting
           compiling
           installation
           configuration




                                                                                                                         3
                               Perl                                              Programming Languages
                                                                       Input/Output in Perl
   Basics of a Perl program under UNIX
                                                                       Reading in from the keyboard...
   Perl is an interpreted language
                                                                          $line = <STDIN>;
   The first line of a Perl program (in UNIX) is...
                                                                       Filehandles...
      #!/usr/bin/perl
                                                                       File:
   The # character is the comment character.
                                                                           open(FH,”filename”);
                                                                           open(FH,”>filename”);
   All single-expression statements must end in a semi-colon.
                                                                           ...
       $area = $pi * $radius * $radius;
                                                                           $line = <FH>;
       while (CONDITION) {
                                                                           ...
           # some stuff
                                                                           close(FH);
       }
                                                                       DO HELLO WORLD WALK-THROUGH.




              Programming Languages                                              Programming Languages

Data Types                                                            Variables - Pieces of data stored within a program.
                                                                      (similar to variables in arithmetic)
   Integer - 0, 1, 2, …, 1000, 1001, …
   Floating Point - 0.0, 0.001, 0.0003, 3.14159265, …                 scalar variables are distinguished by the ‘$’ at their front.
   Character - a, b, c, d, …, 0, 1, 2, :, !, …
                                                                      Any name beginning with a letter is allowed
Different languages use different conventions. In Perl, a string is      $a
also a basic data type. A string is a sequence of 0 or more              $a1
characters.                                                              $alphabet_soup_is_OK_to_me




              Programming Languages                                              Programming Languages
               Arithmetic Operations                                              Arithmetic Operations
        +Addition-Subtraction*Multiplication/Division%Modulo++Increment--Decrement||Logical OR&&Logical AND!Logical Negation
                                                                      ==EqEquality!=neqInequality>Greater than>=… or equal




                                                                                                                                      4
                 Programming Languages                                                Programming Languages
                       Statements
                                                                   Variable Types
  A program can be broken down into basic structures called
  statements. Statements are terminated by a semi-colon.
                                                                               Scalar - a single value
                                                                               Array - a list of values (indexed by sequential number)
      print “Hello, world!\n”;
                                                                               Hash - a set of key,value pairs
  Assignment statements use a single ‘=‘ rather than the ‘==‘ of
                                                                   Prime Numbers = (1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, …)
  the equality operation.

      $pi = 3.1415926;                                                 0          1              First            1
      $area = $pi * $radius * $radius;                                 1          2              Second           2
      $line = <STDIN>;                                                 2          3              Third            3
                                                                       3          5              Fourth           5
                                                                       .
                                                                       .
                                                                       .          .
                                                                                  .
                                                                                  .                 .
                                                                                                    .
                                                                                                    .             .
                                                                                                                  .
                                                                                                                  .




                 Programming Languages                                                Programming Languages
                                                                   Hash - “associative array”
Arrays are good when the data is dense, and the algorithm uses        • array indices can be any unique set of “keys”
a linear access pattern.                                              • excellent for accessing in random patterns (in sparse data)
                                                                          (Ex. “is 19 a prime number?”)
Prime Numbers = (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, …)
                                                                   0       1     2     3    4    5    6    7      8   9   10 11
0    1       2    3     4    5    6     7   8     9   10   11      1       2      3    5    7   11   13 17        19 23   29 31
      0      1     1    0    1    0     1    0    0   0    1
                                                                   2       3      5    7   11   13 17 19 23           29 31
0    1       2     3    4    5    6     7   8     9   10 11        1       1      1    1    1    1    1    1      1   1   1    1
2     3      5    7    11    13 17 19 23          29 31




                 Programming Languages                                                Programming Languages
    Scalar --
       $foo, $a1, $a2000                                             Hash --
                                                                        %hash, %env
    Array --                                                            to access the element with index of $i
       @array, @ii                                                           $hash{$i}
       to access the element at index $i
           $array[$i]                                                      to get a list of keys used in a hash
                                                                               @key_list = keys(%hash);
          the last index of an array is $#array
                                                                           to determine how many keys are in a hash
          the number of elements in an array is                                $num_elements = @key_list;
              $num_elements = $#array + 1;                                         OR
                 OR                                                            $num_elements = keys(%hash);
              $num_elements = @array;




                                                                                                                                         5
              Programming Languages                                                Programming Languages
                                                                      In many cases, a simple if statement is not sufficient, as
Control of Program Execution                                          multiple alternative outcomes need to be evaluated.

if -- executes a block of code, if the condition evaluates to TRUE    if($light eq “green”) {
                                                                              continue_driving();
if($light eq “green”) {                                               } else {
        continue_driving();                                                   stop_car();
}                                                                     }

if( ($light eq “green”) && ($no_traffic) ) {                          if($light eq “green”) {
         continue_driving();                                                  continue_driving();
}                                                                     } elsif($light eq “red”) {
                                                                              stop_car();
                                                                      } else {
                                                                              go_fast_to_beat_the_yellow();
                                                                      }




              Programming Languages                                                Programming Languages
Control of Program Execution                                           Foreach Loop…

Sometimes you need to iterate through a statement multiple times...    foreach $var (@list) {
                                                                           do_stuff($var);
Looping constructs:                                                    }
   for (…) { … }
   foreach $var (@list) { … }                                          foreach $name (@name_list) {
   while (COND) { … }                                                      print “Name = $name\n”;
                                                                       }

                                                                       foreach $name (@name_list) {
                                                                           if($hair_color{$name} eq “blond”) {
                                                                               print “$name has blond hair.\n”;
                                                                           }
                                                                       }




              Programming Languages                                                Programming Languages
for (INIT; COND; POST) {                                                while (COND) {
        do_stuff();                                                            do_stuff();
}                                                                       }

for ($i=0; $i < 50;$i++) {                                              while($line = <FILE_HANDLE>) {
        print “i = $i\n”;                                                      print “$line”;
}                                                                       }

for ($i=0; $i < 50; $i++) {                                             while($flag ==0) {
        if($prime{$i} == 1) {                                                  if($prime{$position} == 1) {
                print “$i is prime!\n”;                                                $flag = 1;
        } else {                                                               } else {
                print “$i is not prime.\n”;                                            $position++;
        }                                                                      }
}                                                                       }




                                                                                                                                   6
                                                                                 Review of Perl Concepts

                                                                    Data Types
                     Intermission                                      scalar
                                                                       array
                                                                       hash

                                                                    Input/Output
                                                                       open(FILEHANDLE,”filename”);
                                                                       $line = <FILEHANDLE>;
                                                                       print “$line”;

                                                                    Arithmetic Operations
                                                                        +, -, *, /, %
                                                                        &&, ||, !




            Review of Perl Concepts                                               Regular Expressions

                                                                    General approach to the problem of pattern matching
Control Structures
   if                                                               RE’s are a compact method for representing a set of possible
   if/else                                                          strings without explicitly specifying each alternative.
   if/elsif/else
                                                                    For this portion of the discussion, I will be using {} to
    foreach                                                         represent the scope of a set.
                                                                        {A}
    for                                                                 {A,AA}
    while                                                           {Ø} = empty set




              Regular Expressions                                                 Regular Expressions
In addition, the [] will be used to denote possible alternatives.   Additional Regular Expression components
                                                                       * = 0 or more of the specified symbol
   [AB] = {A,B}                                                        + = 1 or more of the specified symbol

With just these semantics available, we can begin building          A+ = {A, AA, AAA, … }
simple Regular Expressions.                                         A* = {Ø, A, AA, AAA, … }

   [AB][AB] = {AA, AB, BA, BB}                                      AB* = {A, AB, ABB, ABBB, … }
   AA[AB]BB = {AAABB,AABBB}                                         [AB]* = {Ø, A, B, AA, AB, BA, BB, AAA, … }




                                                                                                                                   7
               Regular Expressions                                                Regular Expressions
What if we want a specific number of iterations?
                                                                    All of these operations are available in Perl
A{2,4} = {AA, AAA, AAAA}                                                                   NameDefinitionCodeWhitespace[space, tab,new-line]\sWordcharacter[a-z
[AB]{1,2} = {A, B, AA, AB, BA, BB}                                  Several “shortcuts”

What if we want any character except one?
[^A] = {B}

What if we want to allow any symbol?

. = {A, B}                                                          \d = {0, 2, 3, 4, 5, 6, 7, 8, 9}
.* = {Ø, A, B, AA, AB, BA, BB, … }                                  \w+\s\w+ = {…, Hello World, … }




                 Pattern Matching                                                     Pattern Matching

Perl supports built-in operations for pattern matching,            Back references…
substitution, and character replacement
                                                                   if($line =~ m/(Rn.\d+)/) {
Pattern Matching                                                           $UniGene_label = $1;
                                                                   }
if($line =~ m/Rn.\d+/) {
        ...
}

In Perl, RE’s can be a part of the string rather than the whole
string.
        ^ - beginning of string
        $ - end of string




               Regular Expressions                                                    Pattern Matching
 $file = “my_fasta_file”;                                         UniGene data file
 open(IN, $file);
 $seq_count = 0;                                                  ID                   Bt.1
 while($line = <IN>) {                                            TITLE                Cow casein kinase II alpha …
         if($line =~ m/^\>/) {                                    EXPRESS              ;placenta
                 $seq_count++;                                    PROTSIM              ORG=Caenorhabditis elegans; …
         }                                                        PROTSIM              ORG=Mus musculus; PROTGI=…
 }                                                                SCOUNT               2
 print “There are $seq_count FASTA sequences in $file.\n”;        SEQUENCE             ACC=M93665; NID=g162776; …
                                                                  SEQUENCE             ACC=BF043619; NID=…
                                                                  //
                                                                  ID                   Bt.2
                                                                  TITLE                Bos taurus cyclin-dependent …
                                                                  ...




                                                                                                                                                8
                 Pattern Matching                                                 Pattern Matching

Let’s write a small Perl program to determine how many          Now we’ll build a Perl program that can write an HTML file
clusters there are in the Bos taurus UniGene file.              containing some basic links based on the Bos taurus UniGene
                                                                clustering.

                                                                Important:


                                                                http://www.ncbi.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&d
                                                                b=Nucleotide&list_uids=GID_HERE&dopt=GenBank




                      Substitution                                                    Substitution

Pattern matching is useful for counting or indexing items,       Substitution can take several different options.
but to modify the data, substitution is required.                   specified after the final slash

Substitution searches a string for a PATTERN and, if found,      The most useful are
replaces it with REPLACEMENT.                                       g - global (can substitute at more than one location)
                                                                    i - case insensitive matching
$line =~ s/PATTERN/REPLACEMENT/;
                                                                 $string = “One fish, Two fish, Red fish, Blue fish.”;
Returns a value equal to the number of times the pattern was     $string =~ s/fish/dog/g;
found and replaced.                                              print “$string\n”;

$result = $line =~ s/PATTERN/REPLACEMENT/;                       One dog, Two dog, Red dog, Blue dog.




                      Substitution                                           Character Replacement
                                                                  A similar operation to substitution is character replacement.
Example: Removing leading and trailing white-space
                                                                  $line =~ tr/a-z/A-Z/;
$line =~ s/^\s*(.*?)\s*$/$1/;
                                                                  $count_CG = $line =~ tr/CG/CG/;
a *? performs a minimal match…
    it will stop at the first point that the remainder of the     $line =~ tr/ACGT/TGCA/;
    expression can be matched.
                                                                         $line =~ s/A/T/g;
$line =~ s/^\s*(.*)\s*$/$1/;                                             $line =~ s/C/G/g;
    this statement will not remove trailing white-space,                 $line =~ s/G/C/g;
    instead the white space is retained by the .*                        $line =~ s/T/A/g;




                                                                                                                                  9
          Character Replacement                                                 Subroutines
while($line = <IN>) {                                       One of the most important aspects of programming is dealing
        $count_CG = $line =~ tr/CG/CG/;                     with complexity. A program that is written in one large
        $count_AT = $line =~ tr/AT/AT/;                     section is generally more difficult to debug. Thus a major
}                                                           strategy in program development is modularization.
$total = $count_CG + $count_AT;
$percent_CG = 100 * ($count_CG/$total);                     Break the program up into smaller portions that can each be
                                                            developed and tested independently.
print “The sequence was $percent_CG CG-rich.\n”;
                                                            Makes the program more readable, and easier to maintain and
                                                            modify.




                   Subroutines                                                  Subroutines

EXAMPLE:                                                    ISSUES:
  Reading in sequences from UniGene.all.seq file               1. Want to design and implement a usable program
                                                               2. Use subroutines where useful to reduce complexity.
Multiple FASTA sequences in a single file, each annotated      3. Minimize the memory requirements.
with the UniGene cluster they belong to.                           (human UniGene seqs > 2 GB)

GOAL:
  Make an output file consisting only of the longest
  sequence from each cluster.




                                                                                                                          10

								
To top