Embed
Email

A GENE ONTOLOGY BASED COMPUTATIONAL APPROACH

Document Sample

Shared by: yurtgc548
Categories
Tags
Stats
views:
4
posted:
12/1/2011
language:
English
pages:
92
A GENE ONTOLOGY BASED COMPUTATIONAL APPROACH



FOR THE PREDICTION OF PROTEIN FUNCTIONS









A Thesis



Presented to



The Graduate Faculty of The University of Akron









In Partial Fulfillment



of the Requirements for the Degree



Master of Science









Saket Kharsikar



August 2007

A GENE ONTOLOGY BASED COMPUTATIONAL APPROACH



FOR THE PREDICTION OF PROTEIN FUNCTIONS









Saket Kharsikar







Thesis







Approved: Accepted:



____________________ ____________________

Advisor Department Chair

Dr. Dale H. Mugler Dr. Daniel B. Sheffer





____________________ ____________________

Co-Advisor Dean of the College

Dr. Zhong-Hui Duan Dr. George K. Haritos





____________________ ____________________

Committee Member Dean of the Graduate School

Dr. Francisco Moore Dr. George R. Newcome





____________________

Date







ii

ABSTRACT





Numerous genome projects have produced a large and ever increasing amount of



genomic sequence data. However, the biological functions of many proteins encoded by



the sequences remain unknown. Protein function annotation and prediction become an



essential and challenging task of post-genomic research. In this research, we present an



automated protein function prediction system based on a set of proteins of known



biological functions. The functions of the proteins are characterized with Gene Ontology



(GO) annotations. The prediction system uses a novel measure to calculate the pair-wise



overall similarity between protein sequences. The protein function prediction is



performed based on the GO annotations of similar sequences using a weighted k-nearest



neighbor method. We show the prediction accuracies obtained using the model organism



yeast (Sacchyromyces cerevisiae). The results indicate that the weighted k-nearest



neighbor method significantly outperforms the regular k-nearest neighbor method for



protein biological function prediction.









iii

ACKNOWLEDGEMENTS





First and foremost, I would like to thank my advisor, Dr Dale Mugler for his



constant support and co-operation all through my Master’s Thesis.



I would also like to thank my co-advisor, Dr Zhong-Hui Duan for giving me the



opportunity to work on this research project and her invaluable inputs in the entire course



of the project. This thesis would not have been possible without her guidance and



persistent help.



A special thanks to my committee- Dr Daniel Sheffer and Dr Francisco Moore for



their time and effort and especially for their invaluable suggestions in biostatistics.



I would like to take a chance to thank my project group Gayatri Paknikar,



Santhosh Tadavai, Neeraj Mogla and Satish Sangem for their help in one of the modules



of my thesis.



Lastly, I would like to express my gratitude towards my parents and all my



friends for their faith and who were always there for me all through the progress of my



thesis and eventually my degree.









iv

TABLE OF CONTENTS

Page



LIST OF TABLES........................................................................................................... viii



LIST OF FIGURES ........................................................................................................... ix



CHAPTER



I. INTRODUCTION ......................................................................................................1



1.1 Proteomics............................................................................................................1



1.2 New Biomedical Opportunities............................................................................3



1.2.1 Drug Development.................................................................................4



1.2.2 Infectious Diseases.................................................................................4



1.2.3 Heart Disease .........................................................................................5



1.2.4 Cancer ....................................................................................................6



1.2.5 Neurological Disorders ..........................................................................6



1.3 Objectives of the project ......................................................................................6



1.4 Hypothesis............................................................................................................7



II. LITERATURE REVIEW ...........................................................................................8



2.1 Protein function prediction ..................................................................................8



2.1.1 Existing approaches for function prediction ..........................................9



2.2 Sequence Alignment and Similarity ..................................................................12



2.2.1 Global Alignment.................................................................................13





v

2.2.2 Local Alignment ..................................................................................13



2.2.3 Semi-global Alignment........................................................................13



2.2.4 Methods for alignment.........................................................................14



2.3 Gene Ontology ...................................................................................................17



2.3.1 Formation of GO..................................................................................17



2.3.2 Three Categories of Gene Ontology ....................................................18



2.3.3 Standardized format .............................................................................22



2.3.4 GO is the way to go .............................................................................22



2.4 Data Classification .............................................................................................23



2.5 Microarrays and Gene Expression Data ............................................................24



2.5.1 Microarray technology.........................................................................24



2.5.2 Gene Expression Data..........................................................................28



2.6 Data Clustering ..................................................................................................31



2.6.1 Hierarchical clustering: agglomerative ................................................32



2.6.2 Hierarchical clustering: divisive ..........................................................33



2.6.3 Non-hierarchical clustering..................................................................34



2.6.4 Predicting protein interactions and protein functions ..........................36



III. MATERIALS AND METHODS..............................................................................38



3.1 Data set...............................................................................................................38



3.1.1 Complete yeast proteome.....................................................................38



3.1.2 GO Annotations ...................................................................................44



3.1.3 Yeast Gene Expression data.................................................................46



3.2 Data pre-processing ...........................................................................................48



vi

3.3 Sequence similarity scoring ...............................................................................49



3.3.1 BLAST Algorithm ...............................................................................49



3.4 Classification......................................................................................................51



3.5 Gene Expression Correlation and Clustering.....................................................54



3.6 A Novel Approach .............................................................................................56



IV. RESULTS AND DISCUSSIONS.............................................................................58



V. CONCLUSIONS.......................................................................................................71



REFERENCES .........................................................................................................73



APPENDIX IMPLEMENTATION OF WEIGHTED KNN ....................................77

CLASSIFICATION ALGORITHM









vii

LIST OF TABLES



Table Page



2.1 Representation of gene expression data ...................................................................29



3.1 Line types and line codes for Swiss-Prot sequence entry .........................................42



3.2 Yeast gene expression data .......................................................................................47



4.1 P-values for test of percentage accuracies ................................................................64



4.2 Cluster details for 4 X 3 clustering scheme ..............................................................67



4.3 Cluster details for 2 X 3 clustering scheme ..............................................................68



4.4 Cluster details for 2 X 5 clustering scheme ..............................................................69









viii

LIST OF FIGURES



Figure Page



1.1 Growth of GenBank database over the period 1982 – 2005.......................................2



2.1 Examples showing global and local alignment between two sequences ..................12



2.2 Cellular Component GO nodes.................................................................................20



2.3 Biological Process GO nodes....................................................................................20



2.4 Molecular Function GO nodes..................................................................................21



2.5 A schematic representation of a microarray chip .....................................................26



2.6 Schematic for the experimental setup for any microarray gene ..............................27

expression analysis



2.7 Various clustering methods.......................................................................................32



2.8 Self-Organizing Maps...............................................................................................35



3.1 BLAST Algorithm ....................................................................................................49



3.2 The graphical representation of a kNN Classifier algorithm with k=5.....................53



3.3 Schematic for the novel approach.............................................................................56



4.1 Numbers of GO groups at different levels of the gene ontologies ...........................59



4.2 Average size of GO groups at different levels of the ontologies..............................59



4.3 GO Tree for molecular function ontology ................................................................61



4.4 Percentage Accuracy at first level.............................................................................63



4.5 Percentage Accuracy at second level........................................................................63





ix

4.6 Percentage Accuracy at third level ...........................................................................64



4.7 GeneCluster – Web tool for clustering based on SOM.............................................65



4.8 12 clusters from a 4 X 3 clustering scheme ..............................................................66



4.9 Percentage accuracy for 4 X 3 clustering scheme ....................................................67



4.10 Percentage accuracy for 2 X 3 clustering scheme ....................................................68



4.11 Percentage accuracy for 2 X 5 clustering scheme ....................................................69









x

CHAPTER I



INTRODUCTION





Biomedical research over the last couple of years has made tremendous progress



in our understanding of biology and medicine. The recent genomic sequencing of human,



mouse and other organisms and other high-throughput studies based on microarray



technology have been generating massive amounts of sequence data. Faced with the



avalanche of genomic sequences and gene expression data, biological scientists are



confronting a challenging prospect: piles of information but only flakes of knowledge.







1.1 Proteomics



Proteins are the major components of living organisms and constitute more than



25% by dry weight of a typical cell. These proteins perform a variety of functions like



catalysis, immune recognition, cell adhesion, signal transduction, sensory capabilities,



transport, movement and cell organization. Chemically, proteins are linearly oriented



hetero-polymers of amino acids; determined by the sequence of nucleotides in the



corresponding gene. Proteins are the main catalysts, structural elements, signaling



messengers and molecular machines of the biological tissues. Functional proteomics



targets learning the functions of protein molecules. All the primary functional knowledge



about the proteins comes from some kind of biochemical, genetic or structural laboratory





1

experiment on an individual protein. But with the explosion of protein sequence



information and an outburst of the amount of new proteins being sequenced and isolated,



it is not feasible to perform experiments on each and every protein individually. Hence it



is necessary to develop an expansive system for the functional characterization of the



proteins. Scientists are looking at other computational approaches to tackle the problem.









Fig. 1.1 Growth of GenBank database over the period 1982 – 2005

Courtesy GenBank Statistics at National Center for Biotechnology Information (NCBI)





There have been some computational approaches already suggested to predict



functions using the ever increasing abundance of sequence data produced by high-



throughput means like mass spectrometry and microarray gene expression profiling. The



major challenge for biologists is to use this genetic information available from the



genome sequencing projects - not just to decode the amino acid sequences but also to find



2

out their functions. Proteomics based approaches initially use computer-based similarity



searches against proteins of known function and the results may allow some broad



inferences to be made about the possible functions, which can then be explored



experimentally.



Once a function of an individual protein is found and has been assigned to it, one



can search for other proteins whose amino acid sequences are similar to that of the



original protein whose function is already known. This approach is known as the



homology method and is widely used to extend the knowledge of protein function from



one protein to its cousin proteins, which are presumed to be descended from the same



common ancestral protein [1]. In sequence similarity based approaches, the underlying



assumption is that sequence similarity implies functional similarity. This assumption is



true in many cases. This makes it a popular and widely used technique although it has



been noted that matching sequences do not always infer similar functions and also that



sometimes proteins with dissimilar sequences do possess similar functions.







1.2 New Biomedical Opportunities



The greatest influence of proteomics-based approaches on biomedical research



has not yet been encountered; however, new progress is being achieved. Such approaches



offer a potential in resolving the complex biological intricacies such as the nature of



particular molecular complexes or pathways in disease pathogenesis [2].









3

1.2.1 Drug Development



Drug development is generally associated with the desire to enhance or diminish a



specific activity involved in disease pathogenesis or treatment-involved side-effects.



Most of the drugs employ their effects on the proteins. The strategy of working forward



from the gene is used wherein a specific genetic lesion is identified and the consequent



modifications in protein structure, function and expression are clarified, so that a drug



can be rationally designed to compensate or fix such aberrations. The challenge in



proteomics-based approaches still remains in identifying the target molecules. Cell-



mapping proteomics and expression proteomics are some of the tools used for target



analysis. In addition to target selection, target validation and toxicity studies are also



important.







1.2.2 Infectious diseases



Identification of proteins produced by microorganisms is aided by the small



number of genes and the completion of genome sequencing for many microorganisms.



The aim for these studies has been the search for new diagnostic markers, candidate



antigens for vaccines and determination of virulence. Myobacterium tuberculosis is



probably the most studied microorganism; causing tuberculosis and leading to more than



2 million deaths each year [3]. So an international priority is to improve diagnosis and



develop a more effective vaccine for the same. The findings from the experiments



involving examination of genetic differences between virulent M tuberculosis, virulent M



bovis and the avirulent M bovis BCG are not only useful in identifying potential







4

regulators of virulence but may also provide the basis for discriminating BCG



immunization and presence of virulent tuberculosis [3].







1.2.3 Heart disease



Heart disease, resulting from systemic disease or specific heart-muscle disease, is



one of the leading causes of morbidity and mortality in more developed countries. The



pathogenesis of cardiac dysfunction is still considerably unknown. But a proteomics-



based approach in identifying overall changes in protein expression in heart disease and



heart failure may provide new insights into the cellular mechanisms involved in cardiac



dysfunction. These approaches will further constitute new diagnostic markers and



therapeutic opportunities. The proteomic approach so far has been on dilated



cardiomyopathy, which certainly has multiple causes. The combined results of all these



studies [4-6] have shown that the expression of some 100 cardiac proteins is significantly



different from normal in dilated cardiomyopathy and most of the proteins are less



abundant in the diseased than the normal heart. Most of the identified proteins can be



grouped into three broad functional categories: cytoskeletal and myofibrillar proteins,



proteins associated with mitochondria and energy production, and proteins associated



with stress responses. In addition to the global proteomic approach, cardiac antigens that



elicit specific antibody responses in vivo are being identified. Several of these antigens



may be involved in processes of acute and chronic rejection after cardiac transplantation



and are currently being investigated as potential non-invasive markers [7].









5

1.2.4 Cancer



Although proteomics-based studies are chiefly concerned with the identification



of novel antigens or markers for prognostic, diagnostic or therapeutic use, molecules and



processes implicated in carcinogenesis are also being investigated increasingly. Most



tumor markers in current use were identified using protein-based approaches dating back



to the 1800s. Proteomics-based studies of many tumor types are now underway and they



are likely to benefit from the use of complimentary techniques such as laser capture



micro-dissection to isolate malignant cells for electrophoretic analysis, thus facilitating



the discovery of tumor-specific proteins [8].







1.2.5 Neurological disorders



Alzheimer’s disease is the most common single cause of dementia. Amino acid



sequencing and identification of common protein fragments form the basis for cloning of



the Amyloid Precursor Protein (APP) gene- a component of the plaque found in the



brains of Alzheimer’s disease patients [9, 10]. Many studies of APP have used two-



dimensional electrophoretic and mass spectrometric techniques and is likely to remain a



principal approach in the future.







1.3 Objectives of the project



1. To study the correlation between protein sequence similarity and function



similarity for the yeast proteome in the context of molecular function ontology.



2. To study the accuracy of function prediction in case of the kNN and the weighted



kNN classifier algorithms.



6

3. To study the relation between gene expression patterns and the functions of the



proteins using microarray data.



4. To design a novel approach by combining the sequence similarity score and the



gene expression correlation coefficient to predict the protein function.



5. To study the effect of clustering of gene expression data on the performance of



the weighted kNN classifier.







1.4 Hypothesis



Null Hypothesis:



The means of the percentage prediction accuracies for the classical kNN and the means



for the weighted kNN classification algorithm calculated for different parameter settings



are not significantly different, i.e. μ1 = μ2



The performance of the classical kNN classification algorithm based on the prediction



accuracy is not significantly different from that of the weighted kNN classification



algorithm.







Alternate Hypothesis:



The means of the percentage prediction accuracies for the classical kNN and the means



for the weighted kNN classification algorithm calculated for different parameter settings



are significantly different, i.e. μ1 ≠ μ2



The performance of the classical kNN classification algorithm based on the prediction



accuracy is significantly different from that of the weighted kNN classification algorithm.







7

CHAPTER II



LITERATURE REVIEW





2.1 Protein Function Prediction



The early approaches to predicting protein function were experimental and



usually focused on a specific target gene or protein, or a small set of proteins forming



natural groups such as protein complexes. However, irrespective of the details, these



approaches are low-throughput because of the huge experimental and human effort



required in analyzing a single gene or protein. This has resulted in a continually



expanding sequence-function gap for the discovered proteins. In an attempt to close this



gap, numerous high-throughput experimental procedures have been invented to



investigate the mechanisms leading to the accomplishment of a protein’s function [11].



These procedures have generated a wide variety of useful data that ranges from simple



protein sequences to complex high-throughput data. These data offer different types of



insights into a protein’s function and related concepts. Furthermore, recent years have



seen the recording of this data in very standardized and professionally maintained



databases; some of them listed under.



1. GenBank at the National Center of Biotechnology Information, National Library



of Medicine, Washington D.C. accessible from



http://www.ncbi.nlm.nih.gov/Entrez





8

2. European Molecular Biology Laboratory (EMBL) Outstation at Hixon, England



http://www.ebi.ac.uk/embl/index.html



3. DNA Databank of Japan (DDBJ) at Mishima, Japan



http://www.ddbj.nig.ac.jp/



4. Protein International Resource (PIR) database at the National Biomedical



Research Foundation in Washington D.C., an annotated database



http://www-nbrf.georgetown.edu/pirwww/



5. The SwissProt protein sequence database at ISREC, Swiss Institute for



Experimental Cancer Research in Epalinges/Lausanne, an annotated database



http://www.expasy.ch/cgi-bin/sprot-search-de



6. The Sequence Retrieval System (SRS) at the European Bioinformatics Institute



http://srs6.ebi.ac.uk







2.1.1 Existing approaches for function prediction



In the domain of automated function prediction, sequences have been heavily



utilized, in both direct homology-based and indirect subsequence and feature-based



approaches. Specifically, techniques that predict protein function from sequence can be



categorized into three classes, namely, sequence homology based approaches,



subsequence based approaches and feature based approaches.



Sequence homology-based approaches:



In homology-based approaches, proteins are said to have a common ancestry if they have



a high similarity score for their amino acid sequences. Sequence alignment is performed



to get a similarity score and a prediction model is built using these entire sequences.



9

Subsequence-based approaches:



Many times not the entire protein sequences, but only some segments of it are



important for determining the function of a given protein. Consequently, the approaches



in this category treat these segments or subsequences as features of a protein sequence



and construct models for the mapping of these features to protein function. These models



are then used to predict the function of a query protein.







Feature-based approaches:



The feature-based approaches attempt to exploit the perspective that the amino



acid sequence is a unique characterization of a protein, and determine several of its



physical and functional features. These features are used to construct a predictive model



which can map the feature-value vector of a query protein to its function.







There are some computational approaches that have been developed to use the



sequence data for protein function prediction. These methods were tested on the



Saccharomyces Genome Database and also incorporated the Gene Ontology (GO)



annotations for prediction.



Cluster analysis of the gene-expression profiles is a common approach for



predicting functions based on the assumption that genes with similar functions are likely



to be co-expressed. Pavlidis et al. [12] considered the problem of inferring gene



functional classifications from heterogeneous datasets consisting of DNA microarray



expression measurements and phylogenetic profiles from the whole genome sequence



comparisons. They demonstrated the application of the support vector machines (SVM)



10

learning algorithm to this functional inference task. Palvidis et al. used a collection of



DNA microarray hybridization experiment data for 2465 yeast genes along with a



phylogenetic profile for each gene and demonstrated the use of SVM learning algorithm



for heterogeneous data..



Using protein-protein interaction data to assign function for novel proteins is



another approach. Proteins often interact with one another in an interaction network to



achieve a common objective. It is therefore possible to infer the functions of proteins



based on the functions of their interaction partners, also known as “guilt by association”.



Schwikowski et al. [13] applied a neighbor-counting method in predicting the function.



They assigned function to an unknown protein based on the frequencies of its neighbors



having certain functions.



Hishigaki et al. [14] enhanced the performance of the neighbor counting method



based on the protein interaction networks by using χ2 statistics values instead of the raw



number of the neighboring members. Both these approaches give equal significance to



all the functions contributed by the protein neighbors in the interaction network.



Other function prediction methods using high-throughput data include machine-learning



approach and Markov random fields. Clare and King [15] used the C4.5 machine learning



algorithm which was adapted to the analysis of phenotype. The output of the algorithm is



a decision tree or equivalently a set of symbolic rules which allow the output to be



interpreted and compared with existing biological data.



Letovsky and Kasif [16] developed a method of assigning functions based on a



probabilistic analysis of graph neighborhoods in a protein-protein interaction network.



The method exploits the fact that graph neighbors are more likely to share functions than



11

nodes which are not neighbors and a binomial model of local neighbor function labeling



probability is combined with a Markov random field propagation algorithm to assign



function probabilities for proteins in the network.



MAGIC (Multisource Association of Genes by Integration of Clusters) approach



to combine heterogeneous data for function assignment has been applied in yeast by



Troyanskaya et al. [17]. MAGIC is a general framework that uses formal Bayesian



reasoning to integrate heterogeneous types of high-throughput biological data for



accurate gene function prediction.







2.2 Sequence Alignment and Similarity



Sequence alignment is the procedure of comparing two (pair-wise alignment) or



more (multiple sequence alignment) DNA or protein sequences by searching for a series



of individual characters or character patterns that are in the same order in the sequences.



In an optimal alignment, non-identical characters and gaps are placed to bring as many



identical or similar characters as possible into the vertically matching registers.



Sequences that can be readily aligned in this manner are said to be similar.



There are two main types of pair-wise sequence alignments- global and local.









Fig. 2.1 Examples showing global and local alignment between two sequences

Courtesy NCBI BLAST [40].



12

2.2.1 Global Alignment



The global alignment is stretched over the entire sequence length to include as



many amino acids as possible up to and including the sequence ends. A global alignment



is made possible by including gaps either within the middle of the alignment or at either



end of one or both sequences. A global alignment might misalign some areas so that more



amino acids along the entire sequence length can be matched. Sequences that are quite



similar and approximately the same length are suitable candidates for global alignment.



The Needleman-Wunsch algorithm is used to produce global alignments between pairs of



DNA or protein sequences.







2.2.2 Local Alignment



In a local alignment, the alignment stops at the ends of regions of strong similarity



and a much higher priority is given to finding these local regions than to extending the



alignment to include more neighboring amino acid pairs. This type of alignment favors



finding conserved nucleotide patterns in DNA sequences or amino acid domains in



protein sequences. Local alignments are more suitable for aligning sequences that are



similar along some of their lengths but dissimilar in others, sequences that differ in length



or sequences that share a conserved region or domain. The Smith-Waterman algorithm is



used to produce local alignments between pairs of DNA or protein sequences.







2.2.3 Semi-global Alignment



Hybrid methods, known as semi-global or "glocal" methods, attempt to find the



best possible alignment that includes the start and end of one or the other sequence. This



13

can be especially useful when the downstream part of one sequence overlaps with the



upstream part of the other sequence. In this case, neither global nor local alignment is



entirely appropriate: a global alignment would attempt to force the alignment to extend



beyond the region of overlap, while a local alignment might not fully cover the region of



overlap.







Sequence alignment is useful for discovering functional, structural and



evolutionary information in DNA or protein sequences. It is important to obtain the best



possible, so-called “optimal” alignment to discover this information. Sequences that are



very much alike, probably have the same function, be it a regulatory role in the case of



similar DNA molecules or a similar biochemical function and three-dimensional structure



in the case of proteins. Also, if two sequences from different organisms are similar, there



may have been a common ancestor sequence; the sequences are then defined as being



homologous. Their alignment indicates the changes that could have occurred between the



homologous sequences and a common ancestor sequence during evolution. With the



advent of genome analysis and large scale sequence comparisons it becomes important to



recognize that sequence similarity may be an indicator of several possible types of



ancestor relationships; or there may not exist any ancestor relationship at all.







2.2.4 Methods for alignment



Alignment of two sequences is performed using the following methods [18].









14

Dot Matrix analysis



A dot matrix analysis is primarily a method for comparing two sequences to look



for possible alignment of characters between the sequences. Since this method displays



graphically any possible sequence alignments as diagonals on the matrix, it can readily



reveal the presence of insertions/deletions. The method is also used for direct or inverted



repeats in protein and DNA sequences and for predicting regions in RNA that are self-



complimentary.



The major advantage of this method is that all possible matches of residues



between two sequences are found, leaving the investigator the choice of identifying the



most significant ones by examination of the dot matrix for long runs of matches, which



appear as diagonals. The major limitation of this method is that most of the dot matrix



computer programs do not show an actual alignment of the sequences.







Dynamic Programming approach



The dynamic programming approach; first used for global alignment of protein



sequences by Needleman and Wunsch (1970) and for local alignment by Smith and



Waterman (1981); provides one or more alignments of the sequences. The algorithm



generates an alignment by starting at the ends of two sequences and attempting to match



all possible pairs of characters between the sequences and by following a scoring scheme



for matches, mismatches and gaps. This procedure generates a matrix of numbers that



represents all possible alignments between the sequences, where the highest set of



sequential scores in the matrix defines an optimal alignment.







15

For proteins, an amino acid substitution matrix, such as the Dayhoff Percent



Accepted Mutation Matrix 250 (PAM250) or Blosum Substitution Matrix (BLOSUM62)



is used to score matches and mismatches. Similar matrices are available for aligning



DNA sequences.



The dynamic programming algorithm is guaranteed in a mathematical sense to



provide the optimal alignment for a given set of user-defined scoring matrix and gap



penalties. However the algorithm can be slow due to the very large number of



computational steps required, which increase approximately as the square or cube of the



sequence lengths.







Heuristic Methods



The word or k-tuple methods are used by the FASTA and BLAST algorithms,



which align two sequences very quickly by first searching for identical short stretches of



sequences (called words or k-tuples) and then joining these words into an alignment by



the dynamic programming method. These methods are fast enough to be suitable for



searching an entire database for the sequence that aligns best with an input query



sequence. The FASTA and BLAST methods are heuristic; that is an empirical method of



computer programming in which rules of thumb are used to find solutions and feedback



is used to improve performance. These methods are reliable in a statistical sense and



usually provide the best scoring alignment that is possible.









16

2.3 Gene Ontology



2.3.1 Formation of GO



The exponential growth in the availability of molecular sequences, particularly



entire genomic sequences, has transformed both the theory and practice of experimental



biology and medicine. There was a time when the biologists and other research scientists



in the medical field characterized proteins by their diverse activities and abundances and



geneticists classified genes based on the phenotypes of their mutations. But today all



biologists and the other related research community acknowledge that it is likely that



there is just a single limited universe of genes and proteins, most of which are conserved



in most of the living cells. This acknowledgement has fueled a grand unification of



biology. Information about the biological role of such a shared protein in one organism



can certainly shed some light and also at times provide concrete inference of its role in



some other organism.



This progress in the approach that the scientific community describes and



conceptualizes shared biological elements has not been able to keep up with high



throughput sequencing. As a result, the current systems of nomenclature for genes and



their products remain divergent even with the highly appreciated underlying similarities.



This has led to a lack of progress in the interoperability of genomic databases. This was



the basis of the formation of the Gene Ontology (GO) Consortium [19].



The GO project is a collaborative effort to address the need for consistent



descriptions of gene products in different databases. The project began in 1998 as



collaboration between three model organism databases –



1) FlyBase (Drosophila) - database for the fruitfly Drosophila melanogaster



17

2) Saccharomyces Genome Database (SGD) - database for the budding yeast



Saccharomyces cerevisiae



3) Mouse Genome Database (MGD) - databases for the mouse Mus musculus







Since then, the GO Consortium has grown to include many databases, including



several of the world's major repositories for plant, animal and microbial genomes. The



GO project has developed three structured controlled vocabularies (ontologies) that



describe gene products in terms of their associated biological processes, cellular



components and molecular functions in a species-independent manner. There are three



separate aspects to this effort: first, the development and maintenance of the ontologies



themselves; second, the annotation of gene products, which entails making associations



between the ontologies and the genes and gene products in the collaborating databases;



and third, development of tools that facilitate the creation, maintenance and use of



ontologies [20].







2.3.2 Three Categories of Gene Ontology



The GO annotations are based on the following three principles.



1. Cellular component



Cellular component refers to the place in the cell where a gene product is active.



Its ontology describes locations, at the levels of sub cellular structures and



macromolecular complexes. Cellular component includes such terms as



‘ribosome’ or ‘proteasome’ specifying where multiple gene products would be



found. It also includes terms such as ‘nuclear membrane’ or ‘Golgi apparatus’.



18

2. Biological process



Biological process refers to a biological objective to which to which the gene or



gene product contributes. A process is accomplished via one or more ordered



assemblies of molecular functions. They often involve a physical or chemical



transformation, meaning that something goes into a process and some other thing



comes out of it. ‘Cell growth and maintenance’ or ‘signal transduction’ are some



examples of high-level or broad biological process terms; whereas ‘translation’ or



‘pyrimidine metabolism’ are a couple of low-level or more specific biological



process terms.







3. Molecular function



Molecular function refers to the biochemical activity of a gene product and also



the capability that a gene product carries as a potential. These may include



transporting things around, binding to things, holding things together and



changing one thing into another. This is different from the biological processes



the gene product is involved in, which involve more than one activity. Examples



of broad functional terms are ‘enzyme’, ‘transporter’ or ‘ligand’. Examples of



narrower functional terms are ‘adenylate cyclase’ or ‘toll receptor ligand’.







Cellular components, biological processes and molecular functions are all



attributes of genes or gene products and each of them may be assigned independently. A



gene product might be associated with or located in one or more cellular components, it is



19

Fig. 2.2 Cellular Component GO nodes

Courtesy Xie et al. [24]









Fig. 2.3 Biological Process GO nodes

Courtesy Xie et al. [24]



20

Fig. 2.4 Molecular Function GO nodes

Courtesy Xie et al. [24]







active in one or more biological processes, during which it performs one or more



molecular functions.



The relationships between a gene product to cellular component, biological



process and molecular function are one-to-many, reflecting the biological reality that a



particular protein may function in several processes, contain domains that carry out



diverse molecular functions and participate in alternative interactions with other proteins,



organelles or locations in the cells.



For example, the gene product ‘cytochrome c’ can be described by the molecular



function term ‘oxidoreductase activity’, the biological process terms ‘oxidative



phosphorylation’ and ‘induction of cell death’, and the cellular component terms



‘mitochondrial matrix’ and ‘mitochondrial inner membrane’.



21

2.3.3 Standardized format



The ontologies constituting GO are modeled as a generic category of graphs



known as directed acyclic graphs (DAGs), which have numerous applications in



computer science, such as Bayesian networks and parse trees created by compilers. Each



node in these graphs represents a specific functional label and is assigned a unique GO id



of the form GO: XXXXXXX, and each edge represents either an is:a or a part:of



relationship. This well-defined structure makes GO easily usable by both humans and



computers [20].







2.3.4 GO is the way to go



Since the establishment of GO, many ontology-based sequence annotation



approaches have been developed including several web-based automated GO annotation



software tools. These attempts typically involve a search of homologous proteins in GO-



mapped databases including Genbank and Swiss-Prot.



Hennig et al’s GOblet [21] is a fully integrated local analysis system comprising



of a GO parser which maps the GO hierarchy onto a set of linearized trees. Every query



sequence generates a complete set of relevant GO Ids obtained from BLAST and this set



is used to build a summary tree. The total counts per GO Id are also given to identify the



significant GO terms.



Zehetner’s OntoBlast [22] presents a list of homologues together with their GO



terms. It uses genome data from nine species preprocessed for this tool. In this tool



available online, the use of BLAST is prominent. E-values signifying the similarity of the







22

gene names in the list are used along with weighting numbers to sort the term list. They



give an indication of how strong the evidence for a term is relative to the other terms.



Martin et al’s GOtcha searches a set of seven model genomes and returns scored



matches [23].



Xie et al’s [24] GO Engine combines sequence homology search based on GO



annotations with text information analysis and cellular localization prediction to increase



the annotation accuracy.



Schug et al [25] developed a rule-based function prediction method based on the



intersection of GO terms that contain protein domain at different similarity levels.



Abascal et al [26] presented an automatic annotation method based on protein



family identification.



Jensen et al [27] used neural networks for the prediction while Vinayagam et al.



[28] used support vector machines to increase the performance of the prediction tool.



The appeal of these approaches is that they can directly assign a biological



meaning to an uncharacterized protein sequence.







2.4 Data Classification



Classification is defined as a procedure in which individual items are placed into



groups based on quantitative information on one or more characteristics inherent in the



items and based on a training set of previously labeled items.



The goal of classification is to build a set of models that can correctly predict the class of



the different objects. The input to these methods is a set of objects (training data), the



classes which these objects belong to (dependent variables) and a set of variables



23

describing different characteristics of the objects (independent variables). Once such a



predictive model is built, it can be used to predict the class of the objects for which class



information is not known a priori. The key advantage of supervised learning methods



over unsupervised methods like clustering is that by having an explicit knowledge of the



classes the different objects belong to, these algorithms can perform an effective feature



selection if that leads to better prediction accuracy.







Some of the major classification models are listed below.



1. Classification by decision tree induction



2. Bayesian Classification



3. Neural Networks



4. Support Vector Machines (SVM)



5. Classification Based on Associations



6. kNN Classification







2.5 Microarrays and Gene Expression



2.5.1 Microarray technology



The traditional approach to genomic research was focused on the collection and



local examination of data on single genes. Today, microarray technologies have made it



possible to monitor the expression levels for tens of thousands of genes in parallel. A



microarray works by exploiting the ability of a given mRNA molecule to bind



specifically to, or hybridize to, the DNA template from which it originated. By using an



array containing many DNA samples, scientists can determine, in a single experiment,



24

the expression levels of hundreds or thousands of genes within a cell by measuring the



amount of mRNA bound to each site on the array. With the aid of a computer, the amount



of mRNA bound to the spots on the microarray is precisely measured, generating a



profile of gene expression in the cell.



A microarray is typically a glass slide on to which DNA molecules are fixed in an



orderly manner at specific locations called spots (or features). A microarray may contain



thousands of spots and each spot may contain a few million copies of identical DNA



molecules that uniquely correspond to a gene as shown in Fig. 2.5. The DNA in a spot



may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond



to a gene. The spots are printed on to the glass slide by a robot or are synthesized by the



process of photolithography. Microarrays may be used to measure gene expression in



many ways, but one of the most popular applications is to compare expression of a set of



genes from a cell maintained in a particular condition (condition A) to the same set of



genes from a reference cell maintained under normal conditions (condition B).



Fig. 2.6 gives a general picture of the experimental steps involved. First, RNA is



extracted from the cells. Next, RNA molecules in the extract are reverse transcribed into



cDNA by using an enzyme reverse transcriptase and nucleotides labeled with different



fluorescent dyes. For example, cDNA from cells grown in condition A may be labeled



with a red dye and from cells grown in condition B with a green dye. Once the samples



have been differentially labeled, they are allowed to hybridize onto the same glass slide.



At this point, any cDNA sequence in the sample will hybridize to specific spots on the



glass slide containing its complementary sequence. The amount of cDNA bound to a spot







25

Fig. 2.5 A schematic representation of a microarray chip [29]







will be directly proportional to the initial number of RNA molecules present for that gene



in both samples. Following the hybridization step, the spots in the hybridized microarray



are excited by a laser and scanned at suitable wavelengths to detect the red and green



dyes. The amount of fluorescence emitted upon excitation corresponds to the amount of



bound nucleic acid. For instance, if cDNA from condition A for a particular gene was in



greater abundance than that from condition B, one would find the spot to be red. If it was



the other way, the spot would be green. If the gene was expressed to the same extent in



both conditions, one would find the spot to be yellow, and if the gene was not expressed



26

in both conditions, the spot would be black. Thus, what is seen at the end of the



experimental stage is an image of the microarray, in which each spot that corresponds to



a gene has an associated fluorescence value representing the relative expression level of



that gene.









Fig. 2.6 Schematic for the experimental setup for any microarray gene expression

analysis [29]





The microarray is scanned following hybridization and a TIFF image file is



normally generated. Once image generation is completed, the image is analyzed to



identify spots. We saw that the relative expression level for a gene can be measured as



27

the amount of red or green light emitted after excitation. The most common metric used



to relate this information is called expression ratio. It is denoted here as Tk and defined as:



Rk

Tk =

Gk



for each gene k on the array, where Rk represents the spot intensity metric for the test



sample and Gk represents the spot intensity metric for the reference sample. This



expression ratio undergoes transformation and normalization to bring out the differential



gene expression [29].







2.5.2 Gene Expression Data



The processed data, after the normalization procedure, can then be represented in



the form of a matrix, often called gene expression matrix; the different forms of which



are shown in Table 2.1. Each row in the matrix corresponds to a particular gene and each



column could either correspond to an experimental condition or a specific time point at



which expression of the genes has been measured. The expression levels for a gene



across different experimental conditions are cumulatively called the gene expression



profile, and the expression levels for all genes under an experimental condition are



cumulatively called the sample expression profile. Once we have obtained the gene



expression matrix, additional levels of annotation can be added either to the gene or to the



sample. For example, the function of the genes can be provided, or the additional details



on the biology of the sample may be provided, such as disease state or normal state.









28

Table 2.1 Representation of gene expression data [29]







To make any meaningful comparison or biological analysis, one should know what the



data in the gene expression matrix represents. Expression data can be represented in five



different ways, which are described below:







1. Absolute measurement



In the case of an absolute measurement, each cell in the matrix will represent the



expression level of the gene in abstract units.









29

2. Relative measurement or expression ratio



In the case of a relative measurement or representations involving expression



ratio, the expression level of a gene in abstract units is normalized with respect to



its expression in a reference condition. This gives the expression ratio of the gene



in relative units. Note that in such cases, a ratio of 4000/100 will lead to the same



result as 40/10.







3. log2(expression ratio)



In the case of tables representing the log2(expression ratio) values, information on



upregulation and down-regulation is captured and is mapped in a symmetric



manner. For example, 4-fold up-regulation maps to log2(4) = 2 and a 4-fold



down-regulation maps to log2(1/4) = -2.







4. Discrete values



Another way of representing information is to convert to discrete numbers the



values in the tables mentioned above. In the case of converting the absolute



measurement to discrete numbers, a binary expression matrix of 1 and 0 can be



used, where 1 means that the gene is expressed above a user defined threshold,



and 0 means that the gene is expressed below this threshold. In the case of making



the relative expression tables or log2 (expression ratio) tables discrete, values can



be divided into 3 classes, +1, 0 and –1, where +1 represents a gene that is



positively regulated, 0 represents a gene that is not differentially regulated and –1



represents a gene that is repressed.



30

5. Representation of expression profiles as vectors



So far we have seen how individual cells in the gene expression matrix can be



represented. Similarly, an expression profile (of a gene or a sample) can be



thought of as a vector and can be represented in vector space. For example, an



expression profile of a gene can be considered as a vector in n dimensional space



(where n is the number of conditions), and an expression profile of a sample with



m genes can be considered as a vector in m dimensional space (where m is the



number of genes).







2.6 Data Clustering



One of the goals of microarray data analysis is to cluster genes or samples with



similar expression profiles together to make meaningful biological inference about the set



of genes or samples. Clustering is one of the unsupervised approaches to classify data



into groups of genes or samples with similar patterns that are characteristic to the group.



Clustering methods can be hierarchical, that is grouping objects into clusters and



specifying relationships among objects in a cluster, resembling a phylogenetic tree or



non-hierarchical, that is grouping into clusters without specifying relationships between



objects in a cluster as schematically represented in Fig. 2.6.









31

Fig. 2.7 Various clustering methods







2.6.1 Hierarchical clustering: agglomerative



In the case of a hierarchical agglomerative clustering, the objects are successively



fused until all the objects are included. For a hierarchical agglomerative clustering



procedure, each object is considered as a cluster. The first step is the calculation of



pairwise distance measures for the objects to be clustered. Based on the pairwise



distances between them, objects that are similar to each other are grouped into clusters.



After this is done, pairwise distances between the clusters are re-calculated, and clusters



that are similar are grouped together in an iterative manner until all the objects are



included into a single cluster. This information can be represented as a dendrogram,



where the distance from the branch point is indicative of the distance between the two



clusters or objects. Comparison of clusters with another cluster or an object can be



carried out using four approaches



32

1. Single linkage clustering (Minimum distance)



In single linkage clustering, distance between two clusters is calculated as the



minimum distance between all possible pairs of objects, one from each cluster. This



method has an advantage that it is insensitive to outliers. This method is also known



as the nearest neighbor linkage.







2. Complete linkage clustering (Maximum distance)



In complete linkage clustering, distance between two clusters is calculated as the



maximum distance between all possible pairs of objects, one from each cluster. The



disadvantage of this method is that it is sensitive to outliers. This method is also



known as the farthest neighbor linkage.







3. Average linkage clustering



In average linkage clustering, distance between two clusters is calculated as the



average of distances between all possible pairs of objects in the two clusters.







4. Centroid linkage clustering



In centroid linkage clustering, an average expression profile (called a centroid) is



calculated.









33

2.6.2 Hierarchical clustering: divisive



Hierarchical divisive clustering is the opposite of the agglomerative method,



where the entire set of objects is considered as a single cluster and is broken down into



two or more clusters that have similar expression profiles. After this is done, each cluster



is considered separately and the divisive process is repeated iteratively until all objects



have been separated into single objects.







2.6.3 Non-hierarchical clustering



Non-hierarchical clustering requires predetermination of the number of clusters.



Non-hierarchical clustering then groups existing objects into these predefined clusters



rather than organizing them into a hierarchical structure.







1. Non-hierarchical clustering: K-means



K-means is a popular non-hierarchical clustering method. In K-means clustering,



the first step is to arbitrarily group objects into a predetermined number of clusters.



The number of clusters can be chosen randomly or estimated by first performing a



hierarchical clustering of the data. Following this step, an average expression



profile (centroid) is calculated for each cluster, this is called initialization. Next,



individual objects are reattributed from one cluster to the other depending on which



centroid is closer to the gene. This procedure of calculating the centroid for each



cluster and re-grouping objects closer to available centroids is performed in an



iterative manner for a fixed number of times, or until convergence.







34

2. Non-hierarchical clustering: Self Organizing Maps









Fig. 2.8 Self-Organizing Maps







Self Organizing Maps (SOMs) work in a manner similar to K-means clustering



(Fig. 2.8). In K-means clustering, one chooses the number of clusters to fit the data,



whereas with SOM the first step is to choose the number and orientation of the



clusters with respect to each other. For example, a two-dimensional grid of nodes



could be the starting point. The grid is projected onto the expression space and each



object is assigned a node that is nearest to it – this is called initialization. In the next



step, a random object is chosen and the node (called a reference vector) which is in



the neighborhood of the object is moved closer to it. The other nodes are moved to a



small extent depending on how close they are to the object chosen. In successive



iterations, with randomly chosen objects, the positions of the nodes are refined and



the radius of neighborhood becomes confined. In this way, the grid of nodes



(initially a two-dimensional grid) is deformed to fit the data. The advantage of this



method, unlike K-means, is that SOM does not force the number of clusters to be



equal to the number of starting nodes in the chosen grid. This is because some



nodes may have no objects associated with them when the map is complete. Other

35

advantages of SOM include providing information on the similarity between the



nodes, and the ability of SOM to produce reliable results even with noisy data.







2.6.4 Predicting protein interactions and protein functions



Integrating expression data with other external information, for example



evolutionary conservation of proteins, have been used to predict interacting proteins,



protein complexes, and protein function. Work by Ge et al. [30] and Jansen and Gerstein



[31] have shown that genes with similar expression profiles are more likely to encode



proteins that interact. When this information is combined with evolutionary conservation



of proteins, meaningful predictions can be made. In a recent work by Teichmann and



Madan Babu [32], it was shown that proteins that are evolutionarily conserved in yeast



and worm and that have similar expression profiles in both organisms tend to be a part of



the same stable complex or interact physically. Noort et al [33] have also shown that the



encoded proteins of conserved, co-expressed gene pairs are highly likely to be part of the



same pathway. Such studies enable us to predict specific gene functions.



Clustering techniques have proven to be helpful to understand gene function, gene



regulation, cellular processes, and subtypes of cells. Genes with similar expression



patterns (co-expressed genes) can be clustered together with similar cellular functions.



This approach may further help in understanding the functions of many genes for which



information has not been previously available. Furthermore, co-expressed genes in the



same cluster are likely to be involved in the same cellular processes, and a strong



correlation of expression patterns between those genes indicates co-regulation. Searching



for common DNA sequences at the promoter regions of genes within the same cluster



36

allows regulatory motifs specific to each gene cluster to be identified and cis-regulatory



elements to be proposed. The inference of regulation through the clustering of gene



expression data also gives rise to hypotheses regarding the mechanism of the



transcriptional regulatory network. Finally, clustering different samples on the basis of



corresponding expression profiles may reveal sub-cell types which are hard to identify by



traditional morphology-based approaches.









37

CHAPTER III



MATERIALS AND METHODS





3.1 Data Set



The crucial step in developing a good prediction system is always to obtain a



good data set. In this study, we used the complete yeast (Sacchyromyces cerevisiae)



proteome.



Before we started designing the algorithm, we downloaded and studied three major



data files



1. Yeast.dat, which contains the sequence information for the entire proteome



2. GOGroups.dat, which contains the annotation details for the Gene Ontology



3. YeastMicroarray.dat, which contains the gene expression data for a complete



yeast cell-cycle







3.1.1 Complete yeast proteome



The July 2005 version of the complete proteome data consisting of 6247 proteins



was downloaded from the UniProt Knowledgebase (Swiss-Prot and TrEMBL) – Protein



Knowledgebase using the ExPASy (Expert Protein Analysis System) proteomics server



of the Swiss Institute of Bioinformatics (SIB) dedicated to the analysis of protein



sequences and structures [34].





38

Swiss-Prot is an annotated protein sequence database, which was established in 1986 and



maintained collaboratively, since 1987, by the group of Amos Bairoch first at the



Department of Medical Biochemistry of the University of Geneva and now at the Swiss



Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL



Outstation- the European Bioinformatics Institute (EBI)).



The Swiss-Prot Protein Knowledgebase consists of sequence entries. Each entry



corresponds to a single contiguous sequence as contributed to the bank or reported in the



literature. The entries are structured so as to be usable by human readers as well as



computer programs. The explanations, descriptions, classifications and other comments



are in English. Wherever possible, symbols familiar to biochemists, protein chemists and



molecular biologists are used. Each sequence entry is composed of lines. Different types



of lines, each with their own format, are used to record the various data that make up the



entry. Sequence entries are composed of different line types, each with its own format



[35].



For each sequence entry the core data consists of:



• The sequence data;



• The citation information (bibliographical references);



• The taxonomic data (description of the biological source of the protein).







A sample sequence entry is shown below.





ID 2A5D_YEAST STANDARD; PRT; 757 AA.

AC P38903;

DT 01-FEB-1995 (Rel. 31, Created)

DT 01-OCT-1996 (Rel. 34, Last sequence update)

DT 10-MAY-2005 (Rel. 47, Last annotation update)

DE Serine/threonine protein phosphatase 2A, 56 kDa regulatory subunit,

DE delta isoform (PP2A, B subunit, B' delta isoform) (RTS1 protein) (SCS1



39

DE protein).

GN Name=RTS1; Synonyms=SCS1; OrderedLocusNames=YOR014W; ORFNames=OR26.04;

OS Saccharomyces cerevisiae (Baker's yeast).

OC Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;

OC Saccharomycetales; Saccharomycetaceae; Saccharomyces.

OX NCBI_TaxID=4932;

RN [1]

RP NUCLEOTIDE SEQUENCE.

RX MEDLINE=96271526; PubMed=8846889;

RA Evangelista C.C. Jr., Rodriguez Torres A.M., Limbach M.P.,

RA Zitomer R.S.;

RT "Rox3 and Rts1 function in the global stress response pathway in

RT baker's yeast.";

RL Genetics 142:1083-1093(1996).

RN [2]

RP NUCLEOTIDE SEQUENCE, AND SUBCELLULAR LOCATION.

RX MEDLINE=96009589; PubMed=7565713;

RA Shu Y., Hallberg R.L.;

RT "SCS1, a multicopy suppressor of hsp60-ts mutant alleles, does not

RT encode a mitochondrially targeted protein.";

RL Mol. Cell. Biol. 15:5618-5626(1995).

RN [3]

RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].

RC STRAIN=S288c / FY1679;

RX MEDLINE=97313270; PubMed=9169874;

RA Dujon B., Albermann K., Aldea M., Alexandraki D., Ansorge W.,

RA Arino J., Benes V., Bohn C., Bolotin-Fukuhara M., Bordonne R.,

RA Boyer J., Camasses A., Casamayor A., Casas C., Cheret G.,

RA Cziepluch C., Daignan-Fornier B., Dang D.V., de Haan M., Delius H.,

RA Durand P., Fairhead C., Feldmann H., Gaillon L., Galisson F.,

RA Gamo F.-J., Gancedo C., Goffeau A., Goulding S.E., Grivell L.A.,

RA Habbig B., Hand N.J., Hani J., Hattenhorst U., Hebling U.,

RA Hernando Y., Herrero E., Heumann K., Hiesel R., Hilger F., Hofmann B.,

RA Hollenberg C.P., Hughes B., Jauniaux J.-C., Kalogeropoulos A.,

RA Katsoulou C., Kordes E., Lafuente M.J., Landt O., Louis E.J.,

RA Maarse A.C., Madania A., Mannhaupt G., Marck C., Martin R.P.,

RA Mewes H.-W., Michaux G., Paces V., Parle-McDermott A.G., Pearson B.M.,

RA Perrin A., Pettersson B., Poch O., Pohl T.M., Poirey R.,

RA Portetelle D., Pujol A., Purnelle B., Ramezani Rad M., Rechmann S.,

RA Schwager C., Schweizer M., Sor F., Sterky F., Tarassov I.A.,

RA Teodoru C., Tettelin H., Thierry A., Tobiasch E., Tzermia M.,

RA Uhlen M., Unseld M., Valens M., Vandenbol M., Vetter I., Vlcek C.,

RA Voet M., Volckaert G., Voss H., Wambutt R., Wedler H., Wiemann S.,

RA Winsor B., Wolfe K.H., Zollner A., Zumstein E., Kleine K.;

RT "The nucleotide sequence of Saccharomyces cerevisiae chromosome XV.";

RL Nature 387:98-102(1997).

RN [4]

RP FUNCTION, SUBUNIT, AND PHOSPHORYLATION.

RX MEDLINE=97299674; PubMed=9154823;

RA Shu Y., Yang H., Hallberg E., Hallberg R.;

RT "Molecular genetic analysis of Rts1p, a B' regulatory subunit of

RT Saccharomyces cerevisiae protein phosphatase 2A.";

RL Mol. Cell. Biol. 17:3242-3253(1997).

CC -!- FUNCTION: The B regulatory subunit might modulate substrate

CC selectivity and catalytic activity, and also might direct the

CC localization of the catalytic enzyme to a particular subcellular

CC compartment.

CC -!- FUNCTION: Multicopy suppressor of ROX3 and HSP60.

CC -!- SUBUNIT: PP2A consists of a common heterodimeric core enzyme,

CC composed of a 36 kDa catalytic subunit (subunit C) and a 65 kDa

CC constant regulatory subunit (PR65 or subunit A), that associates

CC with a variety of regulatory subunits. Proteins that associate

CC with the core dimer include three families of regulatory subunits

CC B (the R2/B/PR55/B55, R3/B''/PR72/PR130/PR59 and R5/B'/B56

CC families), the 48 kDa variable regulatory subunit, viral proteins,

CC and cell signaling molecules.

CC -!- INTERACTION:

CC P47135:JSN1; NbExp=1; IntAct=EBI-1931, EBI-9422;

CC P53958:YNL042W; NbExp=1; IntAct=EBI-1931, EBI-28694;



40

CC -!- SUBCELLULAR LOCATION: Cytoplasmic and nuclear.

CC -!- PTM: Phosphorylated.

CC -!- SIMILARITY: Belongs to the phosphatase 2A regulatory subunit B

CC family.

CC --------------------------------------------------------------------------

CC This SWISS-PROT entry is copyright. It is produced through a collaboration

CC between the Swiss Institute of Bioinformatics and the EMBL outstation -

CC the European Bioinformatics Institute. There are no restrictions on its

CC use by non-profit institutions as long as its content is in no way

CC modified and this statement is not removed. Usage by and for commercial

CC entities requires a license agreement (See http://www.isb-sib.ch/announce/

CC or send an email to license@isb-sib.ch).

CC --------------------------------------------------------------------------

DR EMBL; U06630; AAB38372.1; -; Genomic_DNA.

DR EMBL; S79635; AAB35312.1; -; Genomic_DNA.

DR EMBL; X87331; CAA60763.1; -; Genomic_DNA.

DR EMBL; Z74922; CAA99203.1; -; Genomic_DNA.

DR PIR; S54620; S54620.

DR IntAct; P38903; -.

DR GermOnline; 143602; -.

DR Ensembl; YOR014W; Saccharomyces cerevisiae.

DR SGD; S000005540; RTS1.

DR GO; GO:0005935; C:bud neck; IDA.

DR GO; GO:0005737; C:cytoplasm; IDA.

DR GO; GO:0005634; C:nucleus; IDA.

DR GO; GO:0000159; C:protein phosphatase type 2A complex; TAS.

DR GO; GO:0005816; C:spindle pole body; IDA.

DR GO; GO:0000158; F:protein phosphatase type 2A activity; TAS.

DR GO; GO:0006470; P:protein amino acid dephosphorylation; TAS.

DR InterPro; IPR008938; ARM.

DR InterPro; IPR002554; B56.

DR Pfam; PF01603; B56; 1.

KW Complete proteome; Hydrolase; Nuclear protein; Phosphorylation;

KW Protein phosphatase.

FT COMPBIAS 16 22 Poly-Ser.

FT COMPBIAS 46 51 Poly-Ser.

FT COMPBIAS 98 110 Poly-Ser.

FT COMPBIAS 143 147 Poly-Ser.

FT COMPBIAS 202 213 Poly-Asn.

FT CONFLICT 95 95 T -> S (in Ref. 2).

FT CONFLICT 529 529 L -> F (in Ref. 1).

FT CONFLICT 581 581 A -> R (in Ref. 1).

SQ SEQUENCE 757 AA; 85335 MW; 5A7476C30140331C CRC64;

MMRGFKQRLI KKTTGSSSSS SSKKKDKEKE KEKSSTTSST SKKPASASSS SHGTTHSSAS

STGSKSTTEK GKQSGSVPSQ GKHHSSSTSK TKTATTPSSS SSSSRSSSVS RSGSSSTKKT

SSRKGQEQSK QSQQPSQSQK QGSSSSSAAI MNPTPVLTVT KDDKSTSGED HAHPTLLGAV

SAVPSSPISN ASGTAVSSDV ENGNSNNNNM NINTSNTQDA NHASSQSIDI PRSSHSFERL

PTPTKLNPDT DLELIKTPQR HSSSRFEPSR YTPLTKLPNF NEVSPEERIP LFIAKVDQCN

TMFDFNDPSF DIQGKEIKRS TLDELIEFLV TNRFTYTNEM YAHVVNMFKI NLFRPIPPPV

NPVGDIYDPD EDEPVNELAW PHMQAVYEFF LRFVESPDFN HQIAKQYIDQ DFILKLLELF

DSEDIRERDC LKTTLHRIYG KFLSLRSFIR RSMNNIFLQF IYETEKFNGV AELLEILGSI

INGFALPLKE EHKVFLVRIL IPLHKVRCLS LYHPQLAYCI VQFLEKDPLL TEEVVMGLLR

YWPKINSTKE IMFLNEIEDI FEVIEPLEFI KVEVPLFVQL AKCISSPHFQ VAEKVLSYWN

NEYFLNLCIE NAEVILPIIF PALYELTSQL ELDTANGEDS ISDPYMLVEQ AINSGSWNRA

IHAMAFKALK IFLETNPVLY ENCNALYLSS VKETQQRKVQ REENWSKLEE YVKNLRINND

KDQYTIKNPE LRNSFNTASE NNTLNEENEN DCDSEIQ

//







Each line begins with a two-character line code, which indicates the type of data



contained in the line. The current line types and line codes and the order in which they



appear in an entry, are shown in the Table 3.1.





41

Line code Content Occurrence in an entry

ID Identification Once; starts the entry

AC Accession number(s) Once or more

DT Date Three times

DE Description Once or more

GN Gene name(s) Optional

OS Organism species Once

OG Organelle Optional

OC Organism classification Once or more

OX Taxonomy cross-reference Once

RN Reference number Once or more

RP Reference position Once or more

RC Reference comment(s) Optional

RX Reference cross-reference(s) Optional

RG Reference group Once or more (Optional if RA line)

RA Reference authors Once or more (Optional if RG line)

RT Reference title Optional

RL Reference location Once or more

CC Comments or notes Optional

DR Database cross-references Optional

KW Keywords Optional

FT Feature table data Optional

SQ Sequence header Once

(blanks) Sequence data Once or more

// Termination line Once; ends the entry



Table 3.1 Line types and line codes for Swiss-Prot sequence entry [35]







The two-character line-type code that begins each line is always followed by three



blanks, so that the actual information begins with the sixth character. In general,



information is not extended beyond character position 75, there are however a few



exceptions where lines may be longer. Since the entire database follows the same



standardized format, it is computer readable and can be programmed to access this



database information and process it further for the analysis.



Out of this information about one protein sequence, only the information on





42

1. AC – accession number,



2. DR – database cross-reference and



3. SQ – sequence



is actually used for the prediction algorithm.







The AC (Acession number) line lists the accession number(s) associated with an entry.



The format of the AC line is:

AC AC_number_1;[ AC_number_2;]...[ AC_number_N;]



An example of an accession number line is shown below:

AC P00321;



Semicolons separate the accession numbers and a semicolon terminates the list. If the



information is unable to fit in a single line, more than one AC lines can be used.



The accession number is used by the prediction algorithm to uniquely identify each



protein and get its annotation from the Gene Ontology (GO).







The DR (Database cross-Reference) lines are used as pointers to information related to



entries and found in data collections other than Swiss-Prot.



The format of the DR line is:

DR DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER[;



TERTIARY_IDENTIFIER][; QUATERNARY_IDENTIFIER].



The first item on the DR line, the 'DATABASE_IDENTIFIER', is the abbreviated name



of the data collection to which reference is made. The second item on the DR line, the



'PRIMARY_IDENTIFIER', is an unambiguous pointer to the information entry in the



database to which the reference is being made. The third item on the DR line, the





43

'SECONDARY_IDENTIFIER', is generally used to complement the information given



by the first identifier. A number of DR lines possess a fourth item, the



'TERTIARY_IDENTIFIER', which is generally used to give more detailed information.



Since we are using the GO annotation, we deal with the DR lines with the



'DATABASE_IDENTIFIER' as GO. For GO, the 'PRIMARY_IDENTIFIER' gives us the



primary GO id which gives is actually referenced in the GO database. The



'SECONDARY_IDENTIFIER' is either one of the characters C, F or P for the 3 principal



categories of the GO Database- C: cellular component; F: molecular function; and P:



biological process.

DR GO; GO:0005816; C:spindle pole body; IDA.

DR GO; GO:0000158; F:protein phosphatase type 2A activity; TAS.

DR GO; GO:0006470; P:protein amino acid dephosphorylation; TAS.



Only molecular function (F) ontology is considered for the prediction model.







The SQ (sequence data) line has a line code consisting of two blanks rather than the two-



letter codes used until now. The sequence counts 60 amino acids per line, in groups of 10



amino acids, beginning in position 6 of the line.



An example of sequence data preceded by the corresponding SQ line is shown here:

SQ SEQUENCE 97 AA; 9110 MW; E3C20C259858B830 CRC64;

MTILASICKL GNTKSTSSSI GSSYSSAVSF GSNSVSCGEC GGDGPSFPNA SPRTGVKAGV

NVDGLLGAIG KTVNGMLISP NGGGGGMGMG GGSCGCI









The // (terminator) line contains no data or comments and designates the end of an entry.







3.1.2 GO Annotations



A GO definition file was obtained from the Gene Ontology consortium web site



[36]. In the version of July 2005, there are 19094 GO terms including

44

• 9856 biological process terms,



• 7559 molecular function terms, and



• 1679 cellular component terms.



The GO annotation file is available in the OBO file format and is maintained in



ASCII doc format in the Gene Ontology CVS repository. The OBO file format itself



attempts to achieve the goals of human readability, ease of parsing, extensibility and



minimal redundancy.



The OBO file format consists of a and s. The header is an



unlabeled section at the beginning of the document containing tag-value pairs. A stanza is



a labeled section of the document, indicating that an object of a particular type is being



described. Stanzas consist of a stanza name in square brackets, and a series of tag-value



pairs. Tag-value pairs consist of a tag name, a colon, the tag value, and a newline.

: {} !



The tag name is always a string. The value is always a string, but the value string may



require special parsing depending on the tag with which it is associated. In general, tag-



value pairs occur on a single line.



A sample stanza entry is shown.

[Term]

id: GO:0000006

name: high affinity zinc uptake transporter activity

namespace: molecular_function

def: "Catalysis of the reaction: Zn2+(out) = Zn2+(in), probably powered by proton motive

force." [TC:2.A.5.1.1]

xref_analog: TC:2.A.5.1.1

is_a: GO:0005385 ! zinc ion transporter activity





At present, every OBO stanza always begins with an id tag. The value of the id tag



announces the object to which the rest of the tags in the stanza refer. Normal, non-



anonymous ids have global scope. An object has the same id in every file, and in every



45

namespace. In our parsing system for using the GO annotation, we just deal with some of



the tags mentioned below.





id The unique id of the current term

name The term name. Any term may have only one name defined

namespace The ontology to which the term belongs

alt_id Defines an alternate id for this term

def The definition of the current term

xref A dbxref that describes an analagous term in another vocabulary

is_a This tag describes a subclassing relationship between one term and

another. The value is the id of the term of which this term is a subclass. A

term may have any number of is_a relationships.

relationship This tag describes a typed relationship between this term and another term.

is_obsolete Whether or not this term is obsolete.







3.1.3 Yeast Gene Expression Data



The second kind of yeast data used was the gene expression data for a complete



yeast cell cycle. The availability of complete sequence for the Saccharomyces cerevisiae



genome has made it possible to measure mRNA transcript levels for virtually every yeast



gene. In the study by Cho et al. [37], commercially available high-density oligonucleotide



arrays were used to estimate mRNA transcript levels in synchronized yeast cells at



regular intervals during the cell cycle. DNA oligonucleotide probes are directly



synthesized on these arrays without individual manipulation or PCR amplification,



minimizing the potential for cross-hybridization or clone error. The data was generated



by the Yeast Cell Cycle Analysis Project by Spellman et al. [38], which is available at the



Genomics server of Stanford University Medical Center [39].









46

The dataset includes 24 gene expression profiles measuring the mRNA expression



levels of the proteins during the growth of yeast cells. Each profile includes 6178 related



proteins. A small part of the micro array data is shown in Table 3.2. The rows represent



different genes while the columns represent the gene expression levels at different time



instants.







cdc15_10 cdc15_30 cdc15_50 cdc15_70 cdc15_80 cdc15_90 cdc15_100

YBR217W -0.26 -0.48 -0.07 -0.34 0.23 -0.48 0.35

YBR218C 0.08 0.13 -0.35 -0.25 -0.25 -0.23 -0.16

YBR219C 0.26 0.17 0.18 0.15 0 0.04 -0.42

YBR220C 0.25 -0.09 0.09 -0.14 0.33 -0.14 0.4

YBR221C -0.56 -0.73 -0.5 -0.18 0.05 0.01 -0.46

YBR222C -0.84 -0.79 -0.02 0.08 0.48 0.26 0.25

YBR223C 0.13 0.18 0.18 -0.25 -0.21 -0.04 -0.32

YBR224W -0.2 0.24 0.02 -0.62 0.17 -0.44 0.27

YBR225W 0.2 0.34 0.2 -0.1 -0.16 0.07 -0.07

YBR226C 0.23 0.42 0.1 -0.1 0.11 -0.05 0.25

YBR227C 0.29 0.62 0.68 -0.01 -0.61 0.27 -0.29

YBR228W 0.11 0.25 -0.04 -0.32 0.21 -0.22 0.07

YBR229C 0.02 -0.16 -0.16 -0.06 -0.09 0.15 -0.42

YBR230C -0.3 -0.69 -0.56 -0.88 0.68 -0.2 0.63

YBR231C 0 0.37 -0.11 -0.5 -0.1 -0.36 0.03

YBR232C -0.04 0.29 0 -0.55 -0.06 -0.61 -0.09

YBR233W -0.06 0.22 0.19 -0.26 -0.1 -0.52 0.14

YBR234C -0.53 -0.8 -0.76 -0.29 0.49 -0.24 0.17

YAL020C -0.38 -0.42 -0.28 -0.15 0.16 -0.12 0.26

YAL021C 0.17 0.22 -0.09 0.19 -0.17 0.16







Table 3.2 Yeast gene expression data [39]







The data was recorded over a complete yeast cell cycle. The columns represent different



time slots varying from 10 minutes to 270 minutes. The rows represent the different



genes and the data is the gene expression levels in the form of logarithmic ratios.









47

3.2 Data pre-processing



The files downloaded from the various servers are not used in their entirety. So



they need to be pre-processed to extract the actual information that is necessary and also



to get them in the appropriate format to be processed easily and in a time-efficient



manner.



From the Yeast Proteome file (Yeast.dat) we extract the AC line, the DR line with



the GO database reference to the molecular function (F) ontology and the SQ line with



the complete sequence. The GO ids for the annotated molecular functions are stored



along with the protein accession id in a separate file. The protein accession id along with



the sequence is stored in another file. The first file forms a lookup table to look up for the



GO annotation for any yeast protein. The sequences in the latter file are used in the



BLAST program to obtain the similarity scores among all the sequences.



Similarly, from the Gene Ontology file, only the required information like the GO



id and the “is_a” or “part_of” information are extracted for the molecular function



ontology to form a file from which a tree structure of the entire GO annotation can be



built. The obsolete terms are not considered and the alternate ids (alt_id) are also



recorded if they exist.



Perl codes were written to perform all these pre-processing operations.



In case of the gene expression data, it was found out that there were a lot of genes



with missing data. So the genes with 50% or more of missing data were not considered.



As a result 516 genes were eliminated leaving a data of 5662 genes.



The pre-processing of the gene expression data was performed using MathWorks



MATLAB® R2006a and Microsoft® Excel 2003.



48

3.3 Sequence similarity scoring



Sequence similarities can be measured through pair-wise global alignments or



local alignments. Homologous protein sequences are usually similar over active domains



and thus share common folds and functions. Therefore, local alignment is a more



appropriate method for comparing protein sequences for their functional similarity. There



are several local alignment schemes for comparing protein sequences, including BLAST



that can be used together with different scoring systems such as BLOSUM62 and



BLOSUM80.







3.3.1 BLAST Algorithm









Fig. 3.1 BLAST Algorithm [40]







49

The BLAST algorithm is a heuristic search method that seeks words of length W (default



= 3 in blastp) that score at least T when aligned with the query and scored with a



substitution matrix. Words in the database that score T or greater are extended in both



directions in an attempt to find a locally optimal ungapped alignment or HSP (high



scoring pair) with a score of at least S or an E value lower than the specified threshold.



HSPs that meet these criteria will be reported by BLAST, provided they do not exceed



the cutoff value specified for number of descriptions and/or alignments to report [40].







The program returns a list of local alignments of certain statistical significance.



However, how to measure the overall similarity of two protein sequences is not obvious.



For example, proteins with two similar domains with certain similarity scores could be



considered to be much more similar than proteins with only one domain with a higher



similarity score. In this study, a novel measure of overall similarity of two protein



sequences is introduced. We utilize the alignment tool for BLASTing two sequences to



obtain the list of optimal local alignments. Let {S1, …,Sn} be scores of a list of best local



alignments with certain statistical significance. Instead of using the highest score



(max{S1, …,Sn}) in the list to measure the overall similarity of the two protein sequences,



we use the following score S to measure the overall similarity of two sequences [41].



S = −∑ ln pi , (1)

i







where pi = 1 − e − E stands for the probability of finding a high-scoring segment pair (HSP)

i









with a local alignment score of at least Si, and Ei is the expected number of HSPs of score







50

at least Si and can be obtained directly from the alignment tool. Assuming the HSPs are



independent of each other, the p-value:



p = ∏ pi = e − S (2)

i







measures the probability of finding a pair of protein sequences with a list of scores at



least {S1, …, Sn}. The corresponding E-value for the overall similarity (the expected



number of the lists that have scores at least {S1, …, Sn}) therefore can be written as:



E = − ln(1 − e − S ) . (3)





We use the p-values and the E-values to measure the overall similarity of a pair of protein



sequences. Since when | x |;

close(INFILE);



open(INFILE,";

close(INFILE);



open(INFILE,";

close(INFILE);



open(INFILE,";

close(INFILE);

77

open(OUTFILE,">$outReportFile") || die("Could not create

$outReportFile"); # Output report

%results=("correct",0,"incorrect",0,"failed",0);

$total=0;

foreach $queryprot(@prot)

{

$total++;

chomp $queryprot;

print "Query # $total\tProcessing query $queryprot\n";

$result=prediction($queryprot);

$results{$result}++;

}

print(OUTFILE "\n\nTotal Predictions : \t$total\nCorrect Predictions :

\t$results{'correct'}\nIncorrect Predictions :

$results{'incorrect'}\nFailed Predictions : \t$results{'failed'}\n");

print("\n\nTotal Predictions : \t$total\nCorrect Predictions :

\t$results{'correct'}\nIncorrect Predictions :

$results{'incorrect'}\nFailed Predictions : \t$results{'failed'}\n");

close(OUTFILE);





#======================================================================

============================

# Subroutine to predict the function of the protein

#======================================================================

============================



sub prediction

{

my($query_protein) = @_;

my($K) = 10;

my(%ClassifierCount)=();

my(%MatchProtein)=();

my($hit)=0;

my($prev_protein)=0;

my($prev_GO)=0;



my(@Matches) = getMatch($query_protein);

foreach $eachMatch(@Matches)

{

if($eachMatch eq "0")

{

print(OUTFILE "$query_protein : PREDICTION FAILED - Invalid

query protein\n");

return "failed";

}

my ($MProtein,$MScore)=split(/\t/,$eachMatch);

$MatchProtein{$MProtein} = "$MScore";

}

my(@keys) = sort {$MatchProtein{$a} $MatchProtein{$b}} (keys

%MatchProtein);

if(scalar @keys $ClassifierCount{$a}} (keys

%ClassifierCount);



@Q_node=getGOnode($query_protein);

foreach $eachQ_node(@Q_node)

{

if($eachQ_node eq "0")

{

print(OUTFILE "$query_protein : PREDICTION FAILED - Query

protein GO node not existing\n");

return "failed";

}

@Q_path=getGOpath($eachQ_node);

foreach $eachQ_path(@Q_path)

{

if($eachQ_path eq "0")

{ last; }

@Q_GOPath = split(/\s/,$eachQ_path);

@Q_GOPath=reverse(@Q_GOPath);

if($keys[0] eq $Q_GOPath[1])

{ $hit=1; }



79

}

}

if($hit==1)

{ print(OUTFILE "$query_protein : PREDICTION CORRECT\n"); return

"correct"; }

else

{ print(OUTFILE "$query_protein : PREDICTION INCORRECT\n"); return

"incorrect"; }

}





#======================================================================

============================

# Subroutine to obtain significantly similar proteins using Score file

#======================================================================

============================



sub getMatch

{

my($query)=@_;

my($check)=0;

my(@Match);



foreach $score_line(@score)

{

chomp $score_line;

if($score_line =~ /$query/)

{

$check =1;

$match_line=$score_line;

$match_line=~s/$query//gi;

push(@Match,"$match_line");

}

}

if($check==0)

{ push(@Match,"0"); }

return @Match;

}





#======================================================================

============================

# Subroutine to obtain GO node terms of the protein from Yeast data

file

#======================================================================

============================



sub getGOnode

{

my($protein)=@_;

my($det)=0;



foreach $data_line(@data)

{

chomp $data_line;



80

if($data_line=~/^AC.*$protein/ && $det==0)

{ $det=1; }

elsif($det==1 && $data_line=~/^GO/)

{

$GO_line=$data_line;

$GO_line=~s/GO:/ GO:/gi;

my(@GOlist)=split(/\s/,$GO_line);

shift(@GOlist);

return @GOlist;

}

else

{ $det=0; }

}

push(@GOlist,"0");

return @GOlist;

}



#======================================================================

============================

# Subroutine to obtain the various GO paths for a GO node term

#======================================================================

============================



sub getGOpath

{

my ($GOnode)=@_;

my (@GOTree)=(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);

my ($m)=1;

my ($det)=0;

my ($flag)=0;

my ($currPath)=$GOnode;

my ($currGO)=$GOnode;

$j=0;

$GOTree[0]=$GOnode;



for($i=0;$i<$m;$i++)

{

$currGO=lastterm($GOTree[$i]);

while($currGO ne "GO:0003674")

{

$currPath=$GOTree[$i];

$currGO=lastterm($currPath);

foreach $group_line(@group)

{

chomp $group_line;

if($group_line=~/^GO_id: $currGO/)

{ $flag=1; $det=1; }

elsif($flag==1 && $group_line=~/^is_a/)

{

$flag=0;

@is_a_list=split(/\s/,$group_line);

$num_is_a=scalar(@is_a_list)-2;

$k=$num_is_a;

$m=$m+$k-1;

for($j=0;$j<$k;$j++)



81

{

if($GOTree[$j] eq "0" || $GOTree[$j] eq "$currPath")

{ $GOTree[$j]="$currPath $is_a_list[($num_is_a+2)-($k-

$j)]"; }

else

{ $k++; next; }

}

$currGO=lastterm($GOTree[$i]);

last;

}

}

if($det==0)

{ unshift(@GOTree,"0"); last; }

}

}

for($i=$j;$j<15;$j++)

{ pop @GOTree; }

return @GOTree;

}



#======================================================================

============================

# Subroutine to obtain the last term in any string

#======================================================================

============================



sub lastterm

{

my($str)=@_;

my(@substr)=split(/\s/,$str);

my $numstr=scalar(@substr);

return $substr[$numstr-1];

}









82



Related docs
Other docs by yurtgc548
项目概述
Views: 0  |  Downloads: 0
雅比斯的禱告The Prayer of Jabez
Views: 0  |  Downloads: 0
無投影片標題
Views: 0  |  Downloads: 0
温故校园
Views: 0  |  Downloads: 0
没有幻灯片标题
Views: 0  |  Downloads: 0
氫能源
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!