Learning Center
Plans & pricing Sign in
Sign Out

hands on


									Laboratory of Bioinformatics tools
Irit Gat-viks
School of computer science
Tel-Aviv University
March 21, 2005.
                Laboratory of Bioinformatics tools: Assignment No. 3
                                  Due on May 9th

Pairwise alignment
You are given two human transcription factors: SP1,EGR1.
   1. Search the human SP1, EGR1 in swissprot. What are the known features in the sequences?
       (Provide their locations).
   2. Run the following local pair-wise alignment softwares (with default parameters). Discuss
       the differences between the outputs, and compare them to your prior knowledge about the
       positions of the known features.
           a. LAlign (local alignment) at [5].
           b. PLalign(dot plot) at [5].
           c. PRSS/ PRFX (significance by Monte Carlo) at [5] .
           d. Bl2seq at NCBI.

Sequence based search – blast hands on
Following reference [1]. Explore the human ATM protein.
a. Read p.217-222 in [1] and BLAST your protein sequence. Explore the results in terms of Paralog
and Ortholog (hint: many human hits, who is ATR).
b. Change some of the blast parameters according to p.225-228. Compare your results to the
BLAST result from (a).
c. PSI-BLAST your sequence according to p.240-244. Compare your results to the BLAST result
from (a) (e.g: compare the S.cerevisiae homolog protein hits).

Blast All against all
In this exercise we will explore the relationship between „comparative genomics‟ and ‟functional
genomics‟. Given orthologous protein pairs between yeast and worm, we try to justify the
commonly used paradigm that similar sequence are homologues and thus share the same structure
and function.
* Make sure that all created files are Tab delimited.
1. Our first step is to identify orthologous pairs. A naïve definition would simply maintain that
given a protein from one genome, the gene from the other genome with the highest sequence
similarity is the ortholog. This technique is problematic. What is the problem? The COG database
is using more complicated techniques. Read the introduction to the COG database [3]. How does
COG identify orthology?
2. Follow the COG reasoning, we define “COG pairs” as reciprocal highest sequence similarity.
We analyze S. cerevisiae (yeast) and C. elegans (worm). For non required but informative and
interesting reference, see [2]. Instructions:
a. Use the FASTA files of each of the organism proteome that reside in
"/scratch/blast/BT05/worm.fasta and yeast.fasta".
* To run BLAST against all of an organisms proteins you will need to create the BLAST database.
To create the database you use the command line. e.g.:
        >./formatdb -i yeast.fasta -o T –n yeast
For this exercise don‟t build the database. Use the existing ones /scratch/blast/BT05/worm,
b. Run BLAST for all the yeast protein against all the worm proteins. For that use the existent
worm database. Simply run the command line from /scratch/blast/:

>/scratch/blast/blastall -p blastp -m 8 -d /scratch/blast/worm -i /scratch/BT05/yeast.fasta -o

* Your output file should be set in your own scratch directory (not in blast dir). It's about 15Mb.
* This run takes about 2 hours so don't run it on nova in the working hours. Alternatively you can
use one of the plab computers like plab-10. Just type ">ssh plab-10".
c. Run the BLAST again in the other direction, i.e., all the worm proteins against all the yeast
d. Describe what each flag, from the command line, stands for.
e. Finally write the script that will create the COG pairs list. This script will receive
the two BLAST output files, and the out file name.
        >perl o1o2_blast_fn o2o1_blast_fn out_fn
3. Analyze the functional assignment to genes by the COG pairs. Instructions:
a. Download from [4], for each organism, the gene GO id annotation file ([6]). This file includes
GO functional annotations (GO ids) for each protein.
b. Our aim is to get a broad overview of the ontology content without the detail of the specific fine
grained GO terms. We use GO slims, which are cut-down versions of the GO ontologies containing
a subset of the terms in the whole GO. Therefore, we created for you () a conversion table (in
/scratch/blast/BT05/slim.table) that converts between original GO ids and GO slim ids. It‟s format
is: "go_id      slim_id1        slim_id2         … " (Tab delimited).
Use slim.table and the GO annotation file from (3a) to create a GO Slim annotation files
(slim.annot), i.e., for each gene the list of all its GO Slim ids. For that write the script
> slim.table in_go_annot_fn slim.annot
c. Use the COG pairs list from (2d) and the slim.annot files from (3b) to create the „cogslim_table‟,
giving for each COG pair the list of common GoSlim ids. For that write the script
        > cog2_fn o1_slim_table_fn o2_slim_table_fn cogslim_table
d. Repeat the same analysis, but now COG pairs are defined naively as the best blast hit in one
direction (and not reciprocal). Therefore, create another version of which creates one-
directional COG pairs files (one for blast from yeast to worm, and one from worm to yeast).
Applying 3abc, we get two additional cogslim_tables. In summary, we have 3 cogslim tables: one
cogslim table for reciprocal COG pairs, and 2 cogslim tables for one-directional COG pairs.

4. We now wish to analyze the percentage of COG pairs with identical functional annotation.
Write the script that receives a cogslim table and returns the %identity.
Generate a 3 bar histogram (we have 3 cogslim table…) . Discuss your results in terms of the main
similarity->similar function paradigm (what is the random expected frequency? You can add it as
the fourth bar!), and the advantages/disadvantages of reciprocal hits vs. one-directional hits.

5. Explore the functional distribution of GOSlim annotations. Consider the 5 distributions:
Distributions of GOSlim annotation in the original proteins of yeast/worm (2 cases), and
distribution of GOSlim annotation in the 3 cogslim tables (3 cases). Therefore, for each GOSlim
annotation, you have it‟s frequency in 5 distributions. Generate a file that summaries the result (tab-
delimited). Show the results as histogram. For each GoSlim class, present 5 joined bars.
(For GO id descriptions download the generic GO id file from [4])

[1] “Bioinformatics for Dummies” Chapter 7. Jean-Michel Claverie & Cedric Nortedame, 2003
[2] "Comparison of the complete protein sets of worm and yeast: Orthology and divergence."
Science 282: 2022-2028.
[3] "A genomic perspective on protein families". Koonin, E.V. at el. (2000) Science, 278, 631–637

To top