Automated annotation of proteins User’s guide Version 1.00 2006-09-20 Weizhong Li firstname.lastname@example.org This documentation describes a high throughput annotation system for proteins. Given a list of protein sequences, the system applies many programs such as homology search programs, prediction programs, modeling programs, and so on to obtain various annotations of the proteins. The annotations are stored in a MySQL database and are presented through a web server. This highly automated annotation system covers comprehensive information. The live update feature of the system ensures that annotation data are up-to- data. The web server offers an enhanced graphic interface for users to browse, retrieve, query and search the annotations and proteins. Content: 1. Annotation system 1.1. Annotation framework 1.2. Features 2. Annotation resources and databases 2.1. Databases 2.2. Scripts that update databases 3. Annotation applications 3.1. Annotation applications 3.2. Annotation contents 3.3. Annotations to be added 3.4. Annotation summary 3.5. Dependence and relations of databases and applications 3.6. Scripts that run applications and load annotations to MySQL DBs 4. MySQL database 4.1. Summary 4.2. Common database: protdb 4.3. Annotation database 5. Web server 5.1. Design and summary 5.2. User’s guide 5.2.1. Pull down menus 5.2.2. Cluster listing page 5.2.3. Cluster querying page 5.2.4. Cluster reporting page 5.2.5. Protein listing page 5.2.6. Protein querying page 5.2.7. Protein reporting page 5.2.8. Direct access to MySQL database 6. Developer’s guide 1. Annotation system 1.1. Annotation framework The framework of the annotation system is outlined in Figure 1.1. It is composed of four major components: local databases, annotation applications, data server, and web server. • Local databases are public databases such as NCBI databases, PDB, Swissprot, Pfam, and so on. These databases are downloaded, processed, and stored locally on file system and within MySQL server. The annotation system needs these databases to perform annotation tasks such as running homology searches, retrieve literature information and so on. • Annotation applications are software that are publicly available or developed locally that perform various calculations and predictions. These applications cover a wide spectrum of bioinformatics application, and the annotation system is open to add more software. • Data server holds the annotation information. Majority of the annotation results are stored in a MySQL relational database. Some other data, such as 3D models, are stored as flat files on the file system. The MySQL server has two separate databases. Database protdb1 holds information of common local databases and resources, such as protein sequences in NR, which are needed by the annotation system and the web server. The second database keeps actual annotation results of input proteins. • Web server is the primary interface for users to access the annotations. It is built with Apache web server and Perl/CGI, and it talks to MySQL server through DBD/DBI. Web server offers a graphical interface for users to browse, retrieve, query and search the annotations and proteins. It may also allow users to perform calculation such as running blast, or to provide hand annotation of proteins. Figure 1.1. Annotation framework 1.2. Features The annotation system has following features. • High degree of automation: most annotation processes, including local database maintenance, application launching, result parsing, and data populating to DBs, are automated by a set of scripts. Only a few processes, such as some configurations, need to be performed manually. • Open environment: this system is designed so that new applications can be easily added. Based on existing templates, it is very easy to add new programs to the annotation pipeline and to the web server. • Live update: the system can be configured to update local databases on a regular basis, for example, weekly update of PDB. The system can automatically re-run applications based on newly downloaded databases, so that the annotations are up-to-date. • Graphic interface: the web server provides an interface to present the annotation with several unique features. 2. Annotation resources and databases 2.1. Databases Local databases include public databases and databases that are curated or compiled locally. The downloading and processing of databases are mostly automated. A list of resources is listed below: • NR: the NCBI non-redundant protein database. NR is downloaded from NCBI’s FTP site BLAST db folder as a FASTA file. It is stored locally as a flat file, but it is not used in that form. The GI numbers, text descriptions, and sequences are stored in MySQL database. • NR90: clustered NR at 90% identity. This is one of databases that BLAST and PDB-BLAST search against for homologs. Clustering from NR to NR90 is performed by cd-hit, the clustering information is stored in MySQL database. NR90 is formatted into BLAST ready format by formatdb. • NR+: the database for sequence profile building. NR+ is NR90 plus some external sequences that are not included in NR, such as ORFs from unfinished genomes, sequences from environmental samples, sequences from various meta-genomics projects. This database is clustered by cd-hit at 90% identity, and filtered by seg to mask low-complexity sequences. It is also formatted by FORMATDB into BLAST ready database. • Swissprot sequences: the FASTA sequences from Swissprot. It is downloaded from NCBI’s FTP site BLAST db folder as a single FASTA file. It is stored locally as a single flat file, but it is not used in that form. The access number, GI numbers, text descriptions, and sequences are stored in MySQL database. • Swissprot90: clustered swissprot at 90% identity. This is one of databases that BLAST and PDB- BLAST search against for homologs. Clustering from Swissprot to Swissprot90 is performed by cd-hit, the clustering information is stored in MySQL database. Swissprot90 is formatted into BLAST ready format by formatdb. • Swissprot annotation: the annotation data of swissprot proteins. It is downloaded from swissprot ftp site. It is parsed and information such as pubmed IDs are stored in MySQL database. • PDB sequences: the FASTA sequences from PDB. It is downloaded from NCBI’s FTP site BLAST db folder as a single FASTA file. It is stored locally as a single flat file, but it is not used in that form. The PDB IDs, GI numbers, text descriptions, and sequences are stored in MySQL database. • PDBREF: processed PDB sequences. In this file, only identical sequences are merged if they are in same PDB file. This is one of databases that BLAST and PDB-BLAST search against for homologs. PDBREF is formatted into BLAST ready format by formatdb. • PDB info: other information of PDB structure. These are index files downloaded from RCSB site, including authors, resolutions, descriptions, techniques and so on. They are stored locally as flat files and are also in MySQL database. • KEGG: sequences from KEGG database. It is an optional database in homology search. It is downloaded from KEGG as a FASTA file, and formatted by formatdb. If KEGG is used, the system can retrieve information about pathways. • Pubmed records: literature information that linked with all PDB structures and Swissprot sequences. In many cases, a PDB structure or a Swissprot record has one or more pubmed IDs. All the Pubmed records of them are downloaded from NCBI pubmed. • Pfam: the HMM models of Pfam database. It is used by program Hmmer in searching for Pfam domains. It is downloaded from Pfam site. • FFAS profiles: pre-calculated sequence profiles for FFAS search. The FFAS team maintains and updates several FFAS profile databases including PDB, Pfam and COG. 2.2. Scripts that update database The local databases in above section are downloaded and processed by a set of scripts. Most of these scripts can be run periodically by Cron so that databases are up-to-date. The details of these scripts are documented in Developer’s guide, Appendix A. 3. Annotation applications 3.1. Annotation applications Followings are list of programs in public domain or developed locally applied in the annotation system. • BLAST: the NCBI BLAST package. This includes BLASTP, PSI-BLAST, FORMATDB and so on. BLAST programs are used for homology search. • PDB-BLAST: a modified PSI-BLAST method. It is more efficient and sensitive than regular PSI- BLAST, because it builds sequence profiles from a filtered and clustered database, NR+. It is run by two PSI-BLAST calls. In the first run against NR+, the sequence profile in form of PSSM is saved with option –C. This PSSM is use to search against the object database in second PSI- BLAST run with option -R. • FFAS: profile-profile alignment program. It is more sensitive than PDB-BLAST, and it is slower. URL: http://ffas.burnham.org. • Hmmer: hidden markov model search. URL: http://hmmer.wustl.edu. • CD-HIT: clustering program. It is used to create non-redundant datasets for NR, swissprot and other protein sequence databases. URL: http://cd-hit.org/. • TMHMM: trans-membrane helices prediction. URL: http://www.cbs.dtu.dk/services/TMHMM. • SignalP: signal peptide prediction. URL: http://www.cbs.dtu.dk/services/SignalP. • Jnet: secondary structure prediction. URL: http://www.compbio.dundee.ac.uk/~www-jpred/jnet. • Coils: coiled-coil prediction. URL: http://www.ch.embnet.org/software/COILS_form.html. • Seg: low complexity segment identification. URL: ftp://ftp.ncbi.nlm.nih.gov/pub/seg/. • ClustalW: Multiple sequence alignment. URL: http://www.ebi.ac.uk/clustalw/. • Modeller: 3D structure prediction by homology. URL: http://salilab.org/modeller/. • SSpro: a package of several programs including prediction of surface of proteins etc. URL: http://www.igb.uci.edu/tools/scratch/. 3.2. Annotation contents Followings are the annotation contents: • Coiled-coil is predicted by COILS. • Low complexity region of sequences are identified by SEG. • Signal peptide is predicted by SignalP. SignalP has different models for eukaryotic, Gram positive and Gram negative organisms. However, the organisms of input proteins may be unknown. So the annotation system just runs all the models for each sequence. Therefore, it produces three separate results. • Trans-membrane helices are predicted by TMHMM • Protein secondary structures are predicted by Jnet. Jnet has several modes to run. The simplest and fastest is sequence-only mode. Jnet can also run based on sequence profile or a HMM model; it is a more accurate mode, but it maybe 100 times slower. The system can be configured to use either mode. • Protein domain is obtained by Hmmer against Pfam library. Parameters to run Hmmer can be configured. • Homologs are collected by three programs: BLASTP, PDB-BLAST and FFAS. Parameters to run BLAST, such as e-values, can be configured. Below are combinations of DBs and programs: o PDBREF by BLASTP o Swissprot by BLASTP o NR by BLASTP o PDBREF by PDB-BLAST o Swissprot by PDB-BLAST o NR by PDB-BLAST o PDB by FFAS o Pfam by FFAS o COG by FFAS o More can be added • Multiple sequence alignments are generated for a input protein and its homologs by ClustalW with following steps: o Collect top 30 hits from Swissprot by BLASTP, only the matched fragments, not full sequences. o Collect top 20 hits from PDBREF by BLASTP, only the matched fragments. o Run ClustalW for the input sequence and the matched fragments. • 3D structures are identified for input proteins if their sequences match sequences in PDB. Default cutoff for matching is 100% identical over 60 amino acids. • 3D models are built for proteins that have homologs in PDBREF by BLASTP. Homologs need to meet these criteria: alignment >= 60 amino acids, e-value < 1e-6, sequence identity >=20%, longest gap <= 12 amino acids, and total gaps <=40 amino acids. These are set to prevent model building processes from crashing. Models are built with following steps: o Collect the homologs from PDBREF by BLASTP. o If a sequence has PDB structures, no model will be built at that region. o If multiple PDB templates match the input protein at same position. Only one model is built with the top match as the template. o Download template PDB files from RCSB. o Feed the alignment from BLASTP to modeler program to build actual 3D models. • Literatures are collected through homologs. For example, when a protein hits a Swissprot protein, and this Swissprot protein and its equivalents (same protein with different IDs) has pubmed records linked. These pubmed IDs will be collected in two categories. Literatures from hits that are 100% identical over 60 amino acids are in one group. Other literatures are stored in another group, called homolog’s literature. Some genome papers may link many proteins, but it doesn’t have value to specific protein, so genome papers are removed in the pipeline. 3.3. Annotation to be added The annotation system is designed to be open, so new applications can be easily plugged in. If there are new available programs that provide useful annotations, with minimal efforts, such as adding corresponding MySQL tables and implementing wrap scripts, these programs can be added into the annotation pipeline. Here is a list of applications that may be added in the future. • Predictions of disorder / flexible region of proteins. • Predictions of surface / buried region of proteins. • Predictions of metal binding site. • Predictions of protein localization. • Match to Prosite database • Phylogenic information 3.4. Annotation summary An annotation summary is short text description of the input protein that combines all the annotations. It can be generated on the fly on the web server. 3.5. Dependence and relations of databases and applications Annotations in this system can be divided into three groups. The first group of annotations is directly predicted or calculated from sequences alone. The corresponding applications include COIL, SEG, SignalP, TMHMM, Jnet (sequence-only mode). These annotation need to be calculated just once. The second group is homologs or domain information identified by programs Hmmer, BLASTP, PDB- BLAST, and FFAS. They relay on the target databases, thus they need to be re-calculated once the target databases are updated. The third group of annotations is based on the homolog information including PDB structure, 3D model, multiple alignments, and literatures. They depend on the homologs, thus they need to be reprocessed once the homologs have been changed or updated. 3.6. Scripts that run applications and load annotations to MySQL DBs The annotation system includes wrapping scripts to run each program and load results into MySQL server. The scripts can be run on a workstation or a computer cluster. When the scripts run on a cluster, it is required that all the nodes on the cluster can access the common file system by means like NFS mount. The actual jobs can be submitted to computing nodes by ssh calls, or by any queuing systems such as SGE. Some scripts just run the applications; some scripts just load results to MySQL server, some scripts can do both running and loading. It most depends on the application. For example, program SEG run very fast, it may just take minutes for millions of proteins, so there is no need for distributed computing. In that case, the script run the application on a workstation or on a single node, then it loads the data to MySQL server. Time-consuming applications, such as finding homologs need to be run in parallel mode. In that case, running and loading are separate processes, and therefore there are two separate sets of scripts for running and loading respectively. he detailed usage of these scripts is documented in Developer’s guide. 4. MySQL database 4.1. Summary The MySQL server has two separate databases. Database protdb1 holds information of local databases and resources. The second database keeps actual annotation results of input proteins. This is because the contents of protdb1 are independent of the actual annotation data. The separation of protdb1 and actual annotation database also makes it easy to allow multiple annotation databases. This is supported by one of the MySQL features that allow querying across multiple databases. 4.2. Common database: protdb1 Database protdb1 holds information of local databases and resources. These databases are updated by a set of scripts. For example, when NR is updated, information about IDs, descriptions, clusters is also updated in the corresponding tables in protdb1. Protdb1 has following tables: • db: databases and resources in annotation system • db_history: database update history • nr: basic information about NR • nr_clstr: clusters of NR90 • nr_des: descriptions of NR proteins • nr_seq: sequences of NR proteins • pdb_info: information about PDB sequences • pdb_pubmed: pubmed records of PDB sequences • pdbref_clstr: clusters of PDB • pdbref_des: descriptions of PDB proteins • pdbref_seq: sequences of PDB proteins • pubmed: pubmed records • pubmed_sw: pubmed records of swissprot sequences • sw: information of swissprot • sw_acc2id: mapping table from accession number to ID • swissprot_clstr: clusters of swissprot90 • swissprot_des: descriptions of swissprot proteins • swissprot_seq: sequences of swissprot proteins Detailed schema of protdb1 is documented in Developer’s guide, Appendix C. 4.3. Annotation database Most tables in the annotation database have a primary key prot_id, which is a common key, although foreign keys are not used in this system. If input proteins form clusters or families, table cluster holds the clustering information. The tables are listed below: • cluster: cluster or family information. • prot: protein information. • prot_cluster: mapping table from prot_id to cluster_id. • prot_seq: sequences. • prot_xxx (xxx can be coil, jnet, seg, signalp_euk, signalp_gram0, signalp_gram1, tmhmm): tables that hold sequence features. • prot_xxx_yyy ( xxx can be blastp, pdbblast, ffas, hmmer; yyy can be nr, swissprot, pdbref, pfam, COG etc): tables hold homologs or hmmer hits. • prot_xxx_yyy_aln: tables hold alignments to homologs. • prot_stru_pdb: 3D structures in PDB. • prot_stru_model: 3D models built. • prot_ref: literatures. • prot_homo_ref: literatures of homolog proteins. • prot_homo_save: homologs that provide literatures. • prot_homo_save_aln: alignments of above table. • prot_tbl_summary: summary of all tables Detailed schema of annotation database is documented in Developer’s guide, Appendix D. 5. Web server 5.1. Design and summary Web server is the primary interface for users to access the annotations. It displays the annotation results by a graphical interface within a user’s browser. A user can browse, retrieve, query and search all the annotations and proteins. The web server is built with Apache web server plus Perl/CGI, and it talks to MySQL server through DBD/DBI. Common users can access the annotation data through pull down menu and hyper links on the pages. The server also gives users the direct access to MySQL server by allowing SQL queries. In that case, a user needs to know the database schema (see Developer’s guide, Appendix C and D). The server has several unique features to present the annotations data in a graphic interface. For example, it can align several features together such as secondary structure, trans-membrane helices and so on. 5.2. User’s guide The homepage of an annotation server is shown in Figure 5.1. The top pull-down menu links to various functions of the server. Usually, on the right hand column of the text area, there are also some shortcuts to annotation data. Figure 5.1. Homepage of an annotation web server 5.2.1. Pull-down menu The pull-down menu is a multiple level menu. It has direct links to all the functions in the server, users can also perform simple search within pull-down menu by input, for example, a protein id. The pull-down menu can be different for different annotation datasets. • Home: go to the homepage. • Protein: o List: list all the proteins in annotation database. o Report: report a single protein. o Graphic: report a single protein with a graphic viewer. o Query: query proteins. o Structures: PDB: list all the 3D structures. Model: list all the 3D models. o Reference: list all the references • Families: o List: list all clusters / families. o Report: report a single cluster. o Query: query clusters. • Search: o BLAST: run BLAST search o SQL query: direct query of MySQL server with SQL commands. o From my protein: search annotation database with the IDs of user’s proteins. • Document: various documentations. • Download: links to download area. 5.2.2. Cluster listing page Cluster listing page lists all the clusters or families in the database (Figure 5.2.). Users can go to each cluster reporting page by clicking the cluster id. On this page, users can configure it to have different output formats or change the contents of display. Depends on the content, it may output HTML document, text document, TAB-delimited file for MS-excel, and files for download. For example, if a user chooses ms-excel, the output will be automatically sent into your desktops’ excel program. With the customize button, a user can show new columns of hide existing columns, or to sort data. Figure 5.2. Listing clusters 5.2.3. Cluster querying page Cluster querying page provide a simple interface for a user to search for clusters or families by input one or more conditions (Figure 5.3.). The logic between multiple conditions is “AND”. The clusters meet the conditions will be display as in clustering list page. Figure 5.3. Querying clusters 5.2.4. Cluster reporting page Cluster report page display information of a cluster (Figure 5.4.). A user can go to the protein reporting page by following protein IDs of representative sequence. A summary of functions of the cluster will be added on this page in near future. Figure 5.4. Reporting a cluster 5.2.5. Protein listing page Similar to the cluster listing page, the protein listing page lists all the proteins (Figure 5.5.). Users can go to each protein reporting page by clicking the protein IDs. Users can configure the output formats and contents on this page as on the cluster listing page. Figure 5.5. Listing proteins 5.2.6. Protein querying page Similar to the cluster querying page, it is for a user to search for proteins by input one or more conditions. 5.2.7. Protein reporting page Protein reporting page is the central place to present all the annotation features of a single protein (Figure 5.6.). There is a menu bar on the left side. A user can get each annotation item by clicking the corresponding link. The number of records, such as number of homologs from PDB, is also printed on the left menu. Figure 5.6. Protein reporting page Below is a list of screen shots of protein reporting pages. Please see the figure captions for details. Figure 5.7. Graphical display of protein features This is a graphic viewer of protein. Features can be turned on/off. Protein can be zoomed in/out. The image can be adjusted to different size. Links are available on the map. Figure 5.8. Display of a 3D model This page displays a 3D model. It shows alignment, the 3D model, and the PDB template. Graphic of 3D structures is powered by Jmol. Figure 5.9. Display of Pfam domains Figure 5.10. Display of multiple features Multiple features are aligned for user to visualize them together and combine different source of information. Figure 5.11. Display of multiple sequence alignment This page displays the multiple alignments. There are several options on this page that allow users to configure the alignments such as change coloring schema, adjust the wrapping length. Figure 5.12. Homolog listing page This page lists homologs of the protein. The alignment regions are shown. Figure 5.13. Alignment page This page displays a color alignment. The ruler showing actual residue number is a unique feature. Figure 5.13. Literature listing page This page lists literature of the proteins. It offers several functions such as search, and highlight literatures. 5.2.8. Direct access to MySQL database For users with relational database experiences, an interface is provided on the web server allowing direct MySQL command. The database schema is documented in Appendix C and D. This feature can be reached from menu bar -> search -> SQL query. Users can use show, describe, and select commands there. There is list of sample SQL commands on that page too. 6. Developer’s guide 6.1. Annotation The Developer’s guide is in another document.