Learning Center
Plans & pricing Sign in
Sign Out

Introduction to UNIX Commandline PERL


									Computing Concepts for Bioinformatics
             To web or not to web is the
             Command line/shell review
             Command line exercise
             Introduction to EMBOSS
             EMBOSS input/output
             Programming Process
             Perl concepts
             Your first perl program
Which tools to employ and when !
       Criteria            Web                 Local
User Interface      Simple/Easy        Not Intuitive
Availability        Almost Reliable*   Reliable
Restrictions        Many               Few
Speed               Good to Bad        Fair
Amount (# of seq)   Limited Number     High Through put
Storage             Limited to NONE    Excellent
Update (Database)   Good               Fair*
Update (Programs)   Good               Fair*
Maintenance         Up to Provider     Good
Cost                NONE               $$$
Control             NONE               Excellent
Things to consider for web tools
   Web browsers have copy/paste limit, use
    file upload option for large seq.
   If dataset to analyze is large use e-mail
    service or look for local resources [Super
    computer, Grid, BCF]
   Reproducibility: Underlying database or
    program may get updated while you are in
    middle of analyzing your dataset
   Extensibility: If you have multiple steps in
    your analysis rely on local resources,
    unless you have customized web tools
Things to consider for web tools
   Availability and Stability
       Authors move and institutes take down
        websites [legal and political reasons]
       Specially for “bouquet analysis” try and get the
        program, source from the author
   Security: Many private sector grants
    require that you DO NOT use public
   Output formats: Make sure its using
    standard or documented format [XML, GFF]
    else you cannot extend/import your
    analysis into other programs
   Do not abuse web resources (by writing
    scripts to hog them)
Common Tasks

 Count files, space used/available
 Copy/Move many files [directories]
 Combine files
 Split files
 Maximize disk space
 Transferring files
  [between machines/ accounts]
Common Tasks: Space
   Each account is assigned specific amount
    of disk space(quota)
   To check how much disk space you have
    type: quota –v
   Output is in kbytes [1000kb = 1Mb]

   Always check availability of space before
    you start a large/intensive analysis
   Running out of space will lead to
    incomplete analysis/corrupt reports etc.
Common Tasks: Files
     ls to list files [like dir in dos]

     -ltr [long, time stamp, reverse order]
     Check manual [man ls] for many other
Common Tasks: File and Space
  du Disk usage
  Per file/directory
  -k option gives output in Kbyte
  -s gives total sum [no individual]
Common Tasks: File manip.
  Merge files using the concatenate
  cat file1.txt file2.txt
  cat seq1.tfa seq2.tfa > big.tfa
  cat seq3.tfa seq4.tfa >> big.tfa
  Splitting file content can be done
   using the split or csplit command
   ..we will do most of this using
Common Tasks: Compression
 Sequence data is plain text can be
  compressed by 60% to 75%
 Compressed files are easier to handle
  [lesser time to transfer/move]
 gzip filename [create filename.gz]
 gunzip expands [gzcat is ?]
Common Tasks: File transfer
 scp Secure copy between machines
 scp file.txt user@machine:
   scp -r eeb/class2files/
    For directories/recursive from amadeus to
    my u.arizona account
   To transfer files between users
Common Tasks: File transfer
    Windows to amadeus:
        Download and install WinSCP
        Use windows explorer and mount your
         folder (More later)

    Macintosh (OS X) to amadeus:
        Download and Install Fugu (OS X)
        Using finder and mount your folder
         (More later)
   Open a BioDesk session
   Open Xterminal
   Open Editor (nedit or your favorite)
   Command line (type into Xterminal)
       Remember to put space between options
        cp /home/student/samples/ ./
        cd test (what does cd without arguments do ?)
        ls –al *.pl (what is a wild card ?)
        pwd (print working directory ..where are you
        now ? use cd if you are lost !)
       What is . and   ..
   When typing on the command line use the
    up and down arrow key to navigate between
    previous commands
   Use the right and left arrow key to move
    along the command line (to modify stuff)
   When trying to type a command use the
    “tab” key to autofill the options
       i.e cd pu<TAB> should fill in the rest
       If it does not ..provide few more characters (you
        may have 2 directories starting with pu
        (public_html and put_results)
Exercise 1: Working with seq.
    I have approx ???? seq in ?? files
    Located in the directory
    Check your quota (quota –v)
1.   Copy them to your account
     cp -r /home/student/2003/eeb/exercise-1 ./
     (Use the tab key for auto fill and remember space
     between options)
2.   Count how many files we have ?
     (hint: use ls and wc) (use ls and go to correct dir.)
3.   Check your quota and disk space (du –k)
4.   Count how many sequences we have ?
     cat *.tfa | grep “>” | wc -l
Exercise 1: Cont
 1.   Gzip all files (gzip –v *.tfa)
 2.   -v provides progress/report
 3.   Check disk usage
 4.   Use ls | more to see files
 5.   Use gzcat filename | more
 6.   Count number of sequences in
 7.   Gunzip all files (gunzip *.gz)
 8.   Combine 1-bacteria.tfa 2-bacteria.tfa into
      new.tfa (hint use cat and >)
 9.   Count the number of sequences in new.tfa
Setting up your Editor (Nedit)
 Set your preferences (syntax highlight,
  line number)
 Save default
 Exit
 Restart
Nedit (File dialog box)
    Ignore everything with .
    Double click on directory
     or select with mouse and
     use “enter” key
    What is . and ..
    Use filter if you have many
     files ( *.pl )
    Select the file to edit/open with
     mouse (should have black background)
     then click on OK
    Save (Control-s) and Save As
   European Molecular Biology Open Software
   Free Open Source software analysis package
    specially developed for the needs of the
    molecular biology community
   Provides a comprehensive set of sequence
    analysis programs (approximately 100)
EMBOSS (programs)
   Integrates other publicly available packages
   Can be accessed through BioPERL modules
    (easy automation)
   Sequence alignment
   Rapid database searching with sequence
   Protein motif identification, including domain
   Nucleotide sequence pattern analysis, for
    example to identify CpG islands or repeats.
   Codon usage analysis for small genomes
   Rapid identification of sequence patterns in
    large scale sequence sets.
   Database creation/indexing
Interacting with EMBOSS
   EMBOSS programs are run by typing them
    at the UNIX prompt (in your Xterminal)
    with or without parameters/options
   EMBOSS command syntax follows normal
    UNIX command conventions
   It will prompt you for parameters not
    provided when invoking the program
   In doubt use:
    program_name -help (seqret –help)
    tfm program_name ( tfm seqret )
   Use wossname to search a program by
Sequence Formats
   Sequence Formats:
         •   FASTA
         •   GenBank
         •   EMBL
         •   SwissProt
         •   PIR
   FASTA format:
    >Seq_Name description and some other comment
   IDs and Accessions
         • ID was human readable and name suggested functions etc,
         • Accession number are database assigned
           (now a days they are same as ID )
         • ID 'hsfau' is the 'Homo Sapiens FAU pseudogene„
           Its accession # X65923 (sometimes Accession.1 for version)
   Multiple sequence per file
   No connection between file name and ID
   GFF and Reports (Covered later on)
   USA (Uniform Sequence Address)
         • "format::file"
         • "format::file:entry"
         • "dbname:entry" (we don‟t have this
         • "@listfile" (a file of file-names; ls *.seq > mylist
   Format is not required when reading in a sequence,
    EMBOSS will guess the sequence format by trying
    all known formats until one succeeds
   When writing out a sequence, EMBOSS will use fasta
    format by default. You can specify another format to
Programming Process
    When asked to develop ..look around
     before you re invent the wheel
    Requirement Analysis: What input,
     output, formats, source for data,
     frequency of update etc.
    Design Phase (how and what to use)
    Flow charts for (logic and data) UML,
     use cases
    Pseudocode
     get filename
        open file and read sequences
              For each sequence
                   If length is greater then
                   print error msg #
Programming …
   Now start coding
   Always comment your code
   Use version control
    filename.1 etc for small project
   Code has to be human readable but
    machine parseable !
   Test and debug code using different
    scenarios for input
   Don‟t feel shy to use paper and pencil ..its
    easier at time
Introduction to PERL
   Invoking PERL Basic Input/Output
    STDIN, STDOUT, print and writing to files, sockets
   Variables: Scalar Data
    Numbers 12, 12e5, -12.534
    Strings “who likes Austin Powers?”
    Operators +, -, <, > =
   Flow Control
    if, while, for, foreach
   Arrays
Invoking Perl
    First line of a perl program:
     #! /usr/local/bin/perl
    # by itself means comments, i.e. the line is not
    It is important to comment your
 # Program by Baha Men (Nov10,2000)
 print “Who let the dogs out\?n”;
 # The above line outputs to screen
 # the (only) famous song by the group
    Variable is something that will store
     values while your program is running
    You can set initial values of variables
     and modify these values as the
     program executes.
    No need to pre define
    Automatically get global scope *
    You can store numbers, text in the
     variables                          Note the “ ”
     $a = 1;                                for text
     $z = $a + 3.1412653505;
     $b = “I put the cat out”;
     $gene_name = “C127899.1”;
Arithmetic Operators
    $a = 1 + 2;     # Add 1 and 2 and store in $a
    $a = 3 - 4; # Subtract 4 from 3 and store in $a
    $a = 5 * 6;     # Multiply 5 and 6
    $a = 7 / 8;     # Divide 7 by 8 to give 0.875
    $a = 9 ** 10;# Nine to the power of 10
    $a = 5 % 2;     # Remainder of 5 divided by 2
    $a++;           # Increment $a by 1
    $a--;          #Decrement $a by 1
    if ($a <= 2)   #Lesser than or equal
String Operator
   $b = “Hello”; $c = “World”
   $a = $b . $c;   # Concatenate $b and $c
       print $a;    # This is HelloWorld
   We can do the same using
    print “$a $b from me\n”;
     # This will print Hello World from me
    # Followed by a newline
   Difference between „ and “ covered later
   \n is newline this puts space between 2 lines
   \t is the tab operator i.e Hello   World
Testing Values
    $a == $b # Is $a numerically equal to $b ?
             # Beware: Don't use the = operator.

    $a != $b # Is $a numerically not equal to $b?
    $a eq $b # Is $a string-equal to $b?
    $a ne $b # Is $a string not equal to $b?
    Use == for numbers and eq for strings
Flow Control

     for (initialize; test; increment)
                   second_action; for( $count = 0 ; $count < 10 ; $count++)
                   etc                     { print “ Count is = $count \n”;}

     while (condition)
            {                 while ($president ne “Nader”)
                etc             print “Try again\n”; # Ask again
            }                 $president = <STDIN>; # Get input
     if   (condition)        chomp $president; # Chop off newline
            {                 }
    Array variable is a list of scalars (ie numbers
    and/or strings).
   Same format as scalar except that they have
    a @ i.e @names is a array while $names is
   @names = (“Al”,”George”,”Ralph”);
    @party =
   Array data can be referenced by using the
    index number which starts from 0
    $names[0] is Al and $party[1] is Republican
   You can set values using $names[3]=“Pat”;
I get the picture ..just get on with it !
                  Your first program
                  Create directory prog1 save files
                  Print hello world (That‟s too easy)
                  Ask the user for a name
                  Greet the user
                  Ask the user for password
                  If it matches the password yahoo
                   then greet else boot
                  You can type perl
                   or chmod u+x and run it
                   ./ (remember to cd prog1)
Your program
 # Customary first line
 print “Please enter your name: “;
 # Prompts the user to type and hit enter
 $name = <STDIN>;
 #read from Keyboard and remove new line
 print “Hello $name please give me secret
 $password = <STDIN>;
 # Now compare it to hidden password
     if($password eq “yahoo” ) {
            print “Welcome buddy\n”;
      else { print “Bite Me: $password is invalid\n”;}
Next Class
    Bringideas for your final project
    NCBI Databases

    NCBI e-utils

    More Perl and BioPERL

To top