VIEWS: 13 PAGES: 3 POSTED ON: 2/19/2010
Laboratory of Bioinformatics tools Irit Gat-viks School of computer science Tel-Aviv University March 21, 2005. Laboratory of Bioinformatics tools: Assignment No. 3 Due on May 9th Pairwise alignment You are given two human transcription factors: SP1,EGR1. 1. Search the human SP1, EGR1 in swissprot. What are the known features in the sequences? (Provide their locations). 2. Run the following local pair-wise alignment softwares (with default parameters). Discuss the differences between the outputs, and compare them to your prior knowledge about the positions of the known features. a. LAlign (local alignment) at . b. PLalign(dot plot) at . c. PRSS/ PRFX (significance by Monte Carlo) at  . d. Bl2seq at NCBI. Sequence based search – blast hands on Following reference . Explore the human ATM protein. a. Read p.217-222 in  and BLAST your protein sequence. Explore the results in terms of Paralog and Ortholog (hint: many human hits, who is ATR). b. Change some of the blast parameters according to p.225-228. Compare your results to the BLAST result from (a). c. PSI-BLAST your sequence according to p.240-244. Compare your results to the BLAST result from (a) (e.g: compare the S.cerevisiae homolog protein hits). Blast All against all In this exercise we will explore the relationship between „comparative genomics‟ and ‟functional genomics‟. Given orthologous protein pairs between yeast and worm, we try to justify the commonly used paradigm that similar sequence are homologues and thus share the same structure and function. * Make sure that all created files are Tab delimited. 1. Our first step is to identify orthologous pairs. A naïve definition would simply maintain that given a protein from one genome, the gene from the other genome with the highest sequence similarity is the ortholog. This technique is problematic. What is the problem? The COG database is using more complicated techniques. Read the introduction to the COG database . How does COG identify orthology? 2. Follow the COG reasoning, we define “COG pairs” as reciprocal highest sequence similarity. We analyze S. cerevisiae (yeast) and C. elegans (worm). For non required but informative and interesting reference, see . Instructions: a. Use the FASTA files of each of the organism proteome that reside in "/scratch/blast/BT05/worm.fasta and yeast.fasta". * To run BLAST against all of an organisms proteins you will need to create the BLAST database. To create the database you use the command line. e.g.: >./formatdb -i yeast.fasta -o T –n yeast For this exercise don‟t build the database. Use the existing ones /scratch/blast/BT05/worm, yeast. b. Run BLAST for all the yeast protein against all the worm proteins. For that use the existent worm database. Simply run the command line from /scratch/blast/: >/scratch/blast/blastall -p blastp -m 8 -d /scratch/blast/worm -i /scratch/BT05/yeast.fasta -o /scratch/my_name/output_fn.results * Your output file should be set in your own scratch directory (not in blast dir). It's about 15Mb. * This run takes about 2 hours so don't run it on nova in the working hours. Alternatively you can use one of the plab computers like plab-10. Just type ">ssh plab-10". c. Run the BLAST again in the other direction, i.e., all the worm proteins against all the yeast proteins. d. Describe what each flag, from the command line, stands for. e. Finally write the script create2cog.pl that will create the COG pairs list. This script will receive the two BLAST output files, and the out file name. >perl create2cog.pl o1o2_blast_fn o2o1_blast_fn out_fn 3. Analyze the functional assignment to genes by the COG pairs. Instructions: a. Download from , for each organism, the gene GO id annotation file (). This file includes GO functional annotations (GO ids) for each protein. b. Our aim is to get a broad overview of the ontology content without the detail of the specific fine grained GO terms. We use GO slims, which are cut-down versions of the GO ontologies containing a subset of the terms in the whole GO. Therefore, we created for you () a conversion table (in /scratch/blast/BT05/slim.table) that converts between original GO ids and GO slim ids. It‟s format is: "go_id slim_id1 slim_id2 … " (Tab delimited). Use slim.table and the GO annotation file from (3a) to create a GO Slim annotation files (slim.annot), i.e., for each gene the list of all its GO Slim ids. For that write the script create_slimgo_annot.pl. > create_slim_annot.pl slim.table in_go_annot_fn slim.annot c. Use the COG pairs list from (2d) and the slim.annot files from (3b) to create the „cogslim_table‟, giving for each COG pair the list of common GoSlim ids. For that write the script create_cogslim_table.pl: >create_cogslim.pl cog2_fn o1_slim_table_fn o2_slim_table_fn cogslim_table d. Repeat the same analysis, but now COG pairs are defined naively as the best blast hit in one direction (and not reciprocal). Therefore, create another version of create2cog.pl which creates one- directional COG pairs files (one for blast from yeast to worm, and one from worm to yeast). Applying 3abc, we get two additional cogslim_tables. In summary, we have 3 cogslim tables: one cogslim table for reciprocal COG pairs, and 2 cogslim tables for one-directional COG pairs. 4. We now wish to analyze the percentage of COG pairs with identical functional annotation. Write the script generate_func_stat.pl that receives a cogslim table and returns the %identity. Generate a 3 bar histogram (we have 3 cogslim table…) . Discuss your results in terms of the main similarity->similar function paradigm (what is the random expected frequency? You can add it as the fourth bar!), and the advantages/disadvantages of reciprocal hits vs. one-directional hits. 5. Explore the functional distribution of GOSlim annotations. Consider the 5 distributions: Distributions of GOSlim annotation in the original proteins of yeast/worm (2 cases), and distribution of GOSlim annotation in the 3 cogslim tables (3 cases). Therefore, for each GOSlim annotation, you have it‟s frequency in 5 distributions. Generate a file that summaries the result (tab- delimited). Show the results as histogram. For each GoSlim class, present 5 joined bars. (For GO id descriptions download the generic GO id file from ) References:  “Bioinformatics for Dummies” Chapter 7. Jean-Michel Claverie & Cedric Nortedame, 2003  "Comparison of the complete protein sets of worm and yeast: Orthology and divergence." Science 282: 2022-2028.  "A genomic perspective on protein families". Koonin, E.V. at el. (2000) Science, 278, 631–637  http://www.geneontology.org.  http://fasta.bioch.virginia.edu/fasta_www/home.html  http://www.geneontology.org/GO.current.annotations.shtml.
Pages to are hidden for
"hands on"Please download to view full document