Docstoc

A GENE ONTOLOGY BASED COMPUTATIONAL APPROACH

Document Sample
A GENE ONTOLOGY BASED COMPUTATIONAL APPROACH Powered By Docstoc
					A GENE ONTOLOGY BASED COMPUTATIONAL APPROACH

    FOR THE PREDICTION OF PROTEIN FUNCTIONS




                        A Thesis

                      Presented to

     The Graduate Faculty of The University of Akron




                  In Partial Fulfillment

           of the Requirements for the Degree

                   Master of Science




                    Saket Kharsikar

                      August 2007
            A GENE ONTOLOGY BASED COMPUTATIONAL APPROACH

                FOR THE PREDICTION OF PROTEIN FUNCTIONS




                             Saket Kharsikar



                                 Thesis



Approved:                            Accepted:

____________________                ____________________
Advisor                             Department Chair
Dr. Dale H. Mugler                  Dr. Daniel B. Sheffer


____________________                ____________________
Co-Advisor                          Dean of the College
Dr. Zhong-Hui Duan                  Dr. George K. Haritos


____________________                ____________________
Committee Member                    Dean of the Graduate School
Dr. Francisco Moore                 Dr. George R. Newcome


                                    ____________________
                                    Date



                                   ii
                                          ABSTRACT


Numerous genome projects have produced a large and ever increasing amount of

genomic sequence data. However, the biological functions of many proteins encoded by

the sequences remain unknown. Protein function annotation and prediction become an

essential and challenging task of post-genomic research. In this research, we present an

automated protein function prediction system based on a set of proteins of known

biological functions. The functions of the proteins are characterized with Gene Ontology

(GO) annotations. The prediction system uses a novel measure to calculate the pair-wise

overall similarity between protein sequences. The protein function prediction is

performed based on the GO annotations of similar sequences using a weighted k-nearest

neighbor method. We show the prediction accuracies obtained using the model organism

yeast (Sacchyromyces cerevisiae). The results indicate that the weighted k-nearest

neighbor method significantly outperforms the regular k-nearest neighbor method for

protein biological function prediction.




                                             iii
                                ACKNOWLEDGEMENTS


       First and foremost, I would like to thank my advisor, Dr Dale Mugler for his

constant support and co-operation all through my Master’s Thesis.

       I would also like to thank my co-advisor, Dr Zhong-Hui Duan for giving me the

opportunity to work on this research project and her invaluable inputs in the entire course

of the project. This thesis would not have been possible without her guidance and

persistent help.

       A special thanks to my committee- Dr Daniel Sheffer and Dr Francisco Moore for

their time and effort and especially for their invaluable suggestions in biostatistics.

       I would like to take a chance to thank my project group Gayatri Paknikar,

Santhosh Tadavai, Neeraj Mogla and Satish Sangem for their help in one of the modules

of my thesis.

       Lastly, I would like to express my gratitude towards my parents and all my

friends for their faith and who were always there for me all through the progress of my

thesis and eventually my degree.




                                              iv
                                              TABLE OF CONTENTS
                                                                                                                              Page

LIST OF TABLES........................................................................................................... viii

LIST OF FIGURES ........................................................................................................... ix

CHAPTER

I.     INTRODUCTION ......................................................................................................1

       1.1 Proteomics............................................................................................................1

       1.2 New Biomedical Opportunities............................................................................3

               1.2.1      Drug Development.................................................................................4

               1.2.2      Infectious Diseases.................................................................................4

               1.2.3      Heart Disease .........................................................................................5

               1.2.4      Cancer ....................................................................................................6

               1.2.5      Neurological Disorders ..........................................................................6

       1.3 Objectives of the project ......................................................................................6

       1.4 Hypothesis............................................................................................................7

II.    LITERATURE REVIEW ...........................................................................................8

       2.1 Protein function prediction ..................................................................................8

               2.1.1      Existing approaches for function prediction ..........................................9

       2.2 Sequence Alignment and Similarity ..................................................................12

               2.2.1      Global Alignment.................................................................................13


                                                                v
              2.2.2      Local Alignment ..................................................................................13

              2.2.3      Semi-global Alignment........................................................................13

              2.2.4      Methods for alignment.........................................................................14

      2.3 Gene Ontology ...................................................................................................17

              2.3.1      Formation of GO..................................................................................17

              2.3.2      Three Categories of Gene Ontology ....................................................18

              2.3.3      Standardized format .............................................................................22

              2.3.4      GO is the way to go .............................................................................22

      2.4 Data Classification .............................................................................................23

      2.5 Microarrays and Gene Expression Data ............................................................24

              2.5.1      Microarray technology.........................................................................24

              2.5.2      Gene Expression Data..........................................................................28

      2.6 Data Clustering ..................................................................................................31

              2.6.1      Hierarchical clustering: agglomerative ................................................32

              2.6.2      Hierarchical clustering: divisive ..........................................................33

              2.6.3      Non-hierarchical clustering..................................................................34

              2.6.4      Predicting protein interactions and protein functions ..........................36

III. MATERIALS AND METHODS..............................................................................38

      3.1 Data set...............................................................................................................38

              3.1.1      Complete yeast proteome.....................................................................38

              3.1.2      GO Annotations ...................................................................................44

              3.1.3      Yeast Gene Expression data.................................................................46

      3.2 Data pre-processing ...........................................................................................48

                                                              vi
      3.3 Sequence similarity scoring ...............................................................................49

              3.3.1     BLAST Algorithm ...............................................................................49

      3.4 Classification......................................................................................................51

      3.5 Gene Expression Correlation and Clustering.....................................................54

      3.6 A Novel Approach .............................................................................................56

IV. RESULTS AND DISCUSSIONS.............................................................................58

V.    CONCLUSIONS.......................................................................................................71

      REFERENCES .........................................................................................................73

      APPENDIX IMPLEMENTATION OF WEIGHTED KNN ....................................77
      CLASSIFICATION ALGORITHM




                                                             vii
                                              LIST OF TABLES

Table                                                                                                              Page

2.1   Representation of gene expression data ...................................................................29

3.1   Line types and line codes for Swiss-Prot sequence entry .........................................42

3.2   Yeast gene expression data .......................................................................................47

4.1   P-values for test of percentage accuracies ................................................................64

4.2   Cluster details for 4 X 3 clustering scheme ..............................................................67

4.3   Cluster details for 2 X 3 clustering scheme ..............................................................68

4.4   Cluster details for 2 X 5 clustering scheme ..............................................................69




                                                         viii
                                              LIST OF FIGURES

Figure                                                                                                              Page

1.1   Growth of GenBank database over the period 1982 – 2005.......................................2

2.1   Examples showing global and local alignment between two sequences ..................12

2.2   Cellular Component GO nodes.................................................................................20

2.3   Biological Process GO nodes....................................................................................20

2.4   Molecular Function GO nodes..................................................................................21

2.5   A schematic representation of a microarray chip .....................................................26

2.6   Schematic for the experimental setup for any microarray gene ..............................27
      expression analysis

2.7   Various clustering methods.......................................................................................32

2.8   Self-Organizing Maps...............................................................................................35

3.1   BLAST Algorithm ....................................................................................................49

3.2   The graphical representation of a kNN Classifier algorithm with k=5.....................53

3.3   Schematic for the novel approach.............................................................................56

4.1   Numbers of GO groups at different levels of the gene ontologies ...........................59

4.2   Average size of GO groups at different levels of the ontologies..............................59

4.3   GO Tree for molecular function ontology ................................................................61

4.4   Percentage Accuracy at first level.............................................................................63

4.5   Percentage Accuracy at second level........................................................................63


                                                           ix
4.6   Percentage Accuracy at third level ...........................................................................64

4.7 GeneCluster – Web tool for clustering based on SOM.............................................65

4.8   12 clusters from a 4 X 3 clustering scheme ..............................................................66

4.9   Percentage accuracy for 4 X 3 clustering scheme ....................................................67

4.10 Percentage accuracy for 2 X 3 clustering scheme ....................................................68

4.11 Percentage accuracy for 2 X 5 clustering scheme ....................................................69




                                                         x
                                      CHAPTER I

                                   INTRODUCTION


       Biomedical research over the last couple of years has made tremendous progress

in our understanding of biology and medicine. The recent genomic sequencing of human,

mouse and other organisms and other high-throughput studies based on microarray

technology have been generating massive amounts of sequence data. Faced with the

avalanche of genomic sequences and gene expression data, biological scientists are

confronting a challenging prospect: piles of information but only flakes of knowledge.



1.1 Proteomics

       Proteins are the major components of living organisms and constitute more than

25% by dry weight of a typical cell. These proteins perform a variety of functions like

catalysis, immune recognition, cell adhesion, signal transduction, sensory capabilities,

transport, movement and cell organization. Chemically, proteins are linearly oriented

hetero-polymers of amino acids; determined by the sequence of nucleotides in the

corresponding gene. Proteins are the main catalysts, structural elements, signaling

messengers and molecular machines of the biological tissues. Functional proteomics

targets learning the functions of protein molecules. All the primary functional knowledge

about the proteins comes from some kind of biochemical, genetic or structural laboratory


                                            1
experiment on an individual protein. But with the explosion of protein sequence

information and an outburst of the amount of new proteins being sequenced and isolated,

it is not feasible to perform experiments on each and every protein individually. Hence it

is necessary to develop an expansive system for the functional characterization of the

proteins. Scientists are looking at other computational approaches to tackle the problem.




          Fig. 1.1 Growth of GenBank database over the period 1982 – 2005
 Courtesy GenBank Statistics at National Center for Biotechnology Information (NCBI)


       There have been some computational approaches already suggested to predict

functions using the ever increasing abundance of sequence data produced by high-

throughput means like mass spectrometry and microarray gene expression profiling. The

major challenge for biologists is to use this genetic information available from the

genome sequencing projects - not just to decode the amino acid sequences but also to find

                                            2
out their functions. Proteomics based approaches initially use computer-based similarity

searches against proteins of known function and the results may allow some broad

inferences to be made about the possible functions, which can then be explored

experimentally.

       Once a function of an individual protein is found and has been assigned to it, one

can search for other proteins whose amino acid sequences are similar to that of the

original protein whose function is already known. This approach is known as the

homology method and is widely used to extend the knowledge of protein function from

one protein to its cousin proteins, which are presumed to be descended from the same

common ancestral protein [1]. In sequence similarity based approaches, the underlying

assumption is that sequence similarity implies functional similarity. This assumption is

true in many cases. This makes it a popular and widely used technique although it has

been noted that matching sequences do not always infer similar functions and also that

sometimes proteins with dissimilar sequences do possess similar functions.



1.2 New Biomedical Opportunities

       The greatest influence of proteomics-based approaches on biomedical research

has not yet been encountered; however, new progress is being achieved. Such approaches

offer a potential in resolving the complex biological intricacies such as the nature of

particular molecular complexes or pathways in disease pathogenesis [2].




                                           3
1.2.1 Drug Development

       Drug development is generally associated with the desire to enhance or diminish a

specific activity involved in disease pathogenesis or treatment-involved side-effects.

Most of the drugs employ their effects on the proteins. The strategy of working forward

from the gene is used wherein a specific genetic lesion is identified and the consequent

modifications in protein structure, function and expression are clarified, so that a drug

can be rationally designed to compensate or fix such aberrations. The challenge in

proteomics-based approaches still remains in identifying the target molecules. Cell-

mapping proteomics and expression proteomics are some of the tools used for target

analysis. In addition to target selection, target validation and toxicity studies are also

important.



1.2.2 Infectious diseases

       Identification of proteins produced by microorganisms is aided by the small

number of genes and the completion of genome sequencing for many microorganisms.

The aim for these studies has been the search for new diagnostic markers, candidate

antigens for vaccines and determination of virulence. Myobacterium tuberculosis is

probably the most studied microorganism; causing tuberculosis and leading to more than

2 million deaths each year [3]. So an international priority is to improve diagnosis and

develop a more effective vaccine for the same. The findings from the experiments

involving examination of genetic differences between virulent M tuberculosis, virulent M

bovis and the avirulent M bovis BCG are not only useful in identifying potential



                                            4
regulators of virulence but may also provide the basis for discriminating BCG

immunization and presence of virulent tuberculosis [3].



1.2.3 Heart disease

       Heart disease, resulting from systemic disease or specific heart-muscle disease, is

one of the leading causes of morbidity and mortality in more developed countries. The

pathogenesis of cardiac dysfunction is still considerably unknown. But a proteomics-

based approach in identifying overall changes in protein expression in heart disease and

heart failure may provide new insights into the cellular mechanisms involved in cardiac

dysfunction. These approaches will further constitute new diagnostic markers and

therapeutic opportunities. The proteomic approach so far has been on dilated

cardiomyopathy, which certainly has multiple causes. The combined results of all these

studies [4-6] have shown that the expression of some 100 cardiac proteins is significantly

different from normal in dilated cardiomyopathy and most of the proteins are less

abundant in the diseased than the normal heart. Most of the identified proteins can be

grouped into three broad functional categories: cytoskeletal and myofibrillar proteins,

proteins associated with mitochondria and energy production, and proteins associated

with stress responses. In addition to the global proteomic approach, cardiac antigens that

elicit specific antibody responses in vivo are being identified. Several of these antigens

may be involved in processes of acute and chronic rejection after cardiac transplantation

and are currently being investigated as potential non-invasive markers [7].




                                            5
1.2.4 Cancer

       Although proteomics-based studies are chiefly concerned with the identification

of novel antigens or markers for prognostic, diagnostic or therapeutic use, molecules and

processes implicated in carcinogenesis are also being investigated increasingly. Most

tumor markers in current use were identified using protein-based approaches dating back

to the 1800s. Proteomics-based studies of many tumor types are now underway and they

are likely to benefit from the use of complimentary techniques such as laser capture

micro-dissection to isolate malignant cells for electrophoretic analysis, thus facilitating

the discovery of tumor-specific proteins [8].



1.2.5 Neurological disorders

       Alzheimer’s disease is the most common single cause of dementia. Amino acid

sequencing and identification of common protein fragments form the basis for cloning of

the Amyloid Precursor Protein (APP) gene- a component of the plaque found in the

brains of Alzheimer’s disease patients [9, 10]. Many studies of APP have used two-

dimensional electrophoretic and mass spectrometric techniques and is likely to remain a

principal approach in the future.



1.3 Objectives of the project

   1. To study the correlation between protein sequence similarity and function

       similarity for the yeast proteome in the context of molecular function ontology.

   2. To study the accuracy of function prediction in case of the kNN and the weighted

       kNN classifier algorithms.

                                                6
    3. To study the relation between gene expression patterns and the functions of the

        proteins using microarray data.

    4. To design a novel approach by combining the sequence similarity score and the

        gene expression correlation coefficient to predict the protein function.

    5. To study the effect of clustering of gene expression data on the performance of

        the weighted kNN classifier.



1.4 Hypothesis

Null Hypothesis:

The means of the percentage prediction accuracies for the classical kNN and the means

for the weighted kNN classification algorithm calculated for different parameter settings

are not significantly different, i.e. μ1 = μ2

The performance of the classical kNN classification algorithm based on the prediction

accuracy is not significantly different from that of the weighted kNN classification

algorithm.



Alternate Hypothesis:

The means of the percentage prediction accuracies for the classical kNN and the means

for the weighted kNN classification algorithm calculated for different parameter settings

are significantly different, i.e. μ1 ≠ μ2

The performance of the classical kNN classification algorithm based on the prediction

accuracy is significantly different from that of the weighted kNN classification algorithm.



                                                7
                                        CHAPTER II

                                  LITERATURE REVIEW


2.1 Protein Function Prediction

       The early approaches to predicting protein function were experimental and

usually focused on a specific target gene or protein, or a small set of proteins forming

natural groups such as protein complexes. However, irrespective of the details, these

approaches are low-throughput because of the huge experimental and human effort

required in analyzing a single gene or protein. This has resulted in a continually

expanding sequence-function gap for the discovered proteins. In an attempt to close this

gap, numerous high-throughput experimental procedures have been invented to

investigate the mechanisms leading to the accomplishment of a protein’s function [11].

These procedures have generated a wide variety of useful data that ranges from simple

protein sequences to complex high-throughput data. These data offer different types of

insights into a protein’s function and related concepts. Furthermore, recent years have

seen the recording of this data in very standardized and professionally maintained

databases; some of them listed under.

   1. GenBank at the National Center of Biotechnology Information, National Library

       of Medicine, Washington D.C. accessible from

                    http://www.ncbi.nlm.nih.gov/Entrez


                                            8
   2. European Molecular Biology Laboratory (EMBL) Outstation at Hixon, England

                     http://www.ebi.ac.uk/embl/index.html

   3. DNA Databank of Japan (DDBJ) at Mishima, Japan

                     http://www.ddbj.nig.ac.jp/

   4. Protein International Resource (PIR) database at the National Biomedical

       Research Foundation in Washington D.C., an annotated database

                     http://www-nbrf.georgetown.edu/pirwww/

   5. The SwissProt protein sequence database at ISREC, Swiss Institute for

       Experimental Cancer Research in Epalinges/Lausanne, an annotated database

                     http://www.expasy.ch/cgi-bin/sprot-search-de

   6. The Sequence Retrieval System (SRS) at the European Bioinformatics Institute

                       http://srs6.ebi.ac.uk



2.1.1 Existing approaches for function prediction

       In the domain of automated function prediction, sequences have been heavily

utilized, in both direct homology-based and indirect subsequence and feature-based

approaches. Specifically, techniques that predict protein function from sequence can be

categorized into three classes, namely, sequence homology based approaches,

subsequence based approaches and feature based approaches.

Sequence homology-based approaches:

In homology-based approaches, proteins are said to have a common ancestry if they have

a high similarity score for their amino acid sequences. Sequence alignment is performed

to get a similarity score and a prediction model is built using these entire sequences.

                                               9
Subsequence-based approaches:

       Many times not the entire protein sequences, but only some segments of it are

important for determining the function of a given protein. Consequently, the approaches

in this category treat these segments or subsequences as features of a protein sequence

and construct models for the mapping of these features to protein function. These models

are then used to predict the function of a query protein.



Feature-based approaches:

       The feature-based approaches attempt to exploit the perspective that the amino

acid sequence is a unique characterization of a protein, and determine several of its

physical and functional features. These features are used to construct a predictive model

which can map the feature-value vector of a query protein to its function.



       There are some computational approaches that have been developed to use the

sequence data for protein function prediction. These methods were tested on the

Saccharomyces Genome Database and also incorporated the Gene Ontology (GO)

annotations for prediction.

       Cluster analysis of the gene-expression profiles is a common approach for

predicting functions based on the assumption that genes with similar functions are likely

to be co-expressed. Pavlidis et al. [12] considered the problem of inferring gene

functional classifications from heterogeneous datasets consisting of DNA microarray

expression measurements and phylogenetic profiles from the whole genome sequence

comparisons. They demonstrated the application of the support vector machines (SVM)

                                             10
learning algorithm to this functional inference task. Palvidis et al. used a collection of

DNA microarray hybridization experiment data for 2465 yeast genes along with a

phylogenetic profile for each gene and demonstrated the use of SVM learning algorithm

for heterogeneous data..

       Using protein-protein interaction data to assign function for novel proteins is

another approach. Proteins often interact with one another in an interaction network to

achieve a common objective. It is therefore possible to infer the functions of proteins

based on the functions of their interaction partners, also known as “guilt by association”.

Schwikowski et al. [13] applied a neighbor-counting method in predicting the function.

They assigned function to an unknown protein based on the frequencies of its neighbors

having certain functions.

       Hishigaki et al. [14] enhanced the performance of the neighbor counting method

based on the protein interaction networks by using χ2 statistics values instead of the raw

number of the neighboring members. Both these approaches give equal significance to

all the functions contributed by the protein neighbors in the interaction network.

Other function prediction methods using high-throughput data include machine-learning

approach and Markov random fields. Clare and King [15] used the C4.5 machine learning

algorithm which was adapted to the analysis of phenotype. The output of the algorithm is

a decision tree or equivalently a set of symbolic rules which allow the output to be

interpreted and compared with existing biological data.

       Letovsky and Kasif [16] developed a method of assigning functions based on a

probabilistic analysis of graph neighborhoods in a protein-protein interaction network.

The method exploits the fact that graph neighbors are more likely to share functions than

                                            11
nodes which are not neighbors and a binomial model of local neighbor function labeling

probability is combined with a Markov random field propagation algorithm to assign

function probabilities for proteins in the network.

       MAGIC (Multisource Association of Genes by Integration of Clusters) approach

to combine heterogeneous data for function assignment has been applied in yeast by

Troyanskaya et al. [17]. MAGIC is a general framework that uses formal Bayesian

reasoning to integrate heterogeneous types of high-throughput biological data for

accurate gene function prediction.



2.2 Sequence Alignment and Similarity

       Sequence alignment is the procedure of comparing two (pair-wise alignment) or

more (multiple sequence alignment) DNA or protein sequences by searching for a series

of individual characters or character patterns that are in the same order in the sequences.

In an optimal alignment, non-identical characters and gaps are placed to bring as many

identical or similar characters as possible into the vertically matching registers.

Sequences that can be readily aligned in this manner are said to be similar.

There are two main types of pair-wise sequence alignments- global and local.




     Fig. 2.1 Examples showing global and local alignment between two sequences
                            Courtesy NCBI BLAST [40].

                                             12
2.2.1 Global Alignment

       The global alignment is stretched over the entire sequence length to include as

many amino acids as possible up to and including the sequence ends. A global alignment

is made possible by including gaps either within the middle of the alignment or at either

end of one or both sequences. A global alignment might misalign some areas so that more

amino acids along the entire sequence length can be matched. Sequences that are quite

similar and approximately the same length are suitable candidates for global alignment.

The Needleman-Wunsch algorithm is used to produce global alignments between pairs of

DNA or protein sequences.



2.2.2 Local Alignment

       In a local alignment, the alignment stops at the ends of regions of strong similarity

and a much higher priority is given to finding these local regions than to extending the

alignment to include more neighboring amino acid pairs. This type of alignment favors

finding conserved nucleotide patterns in DNA sequences or amino acid domains in

protein sequences. Local alignments are more suitable for aligning sequences that are

similar along some of their lengths but dissimilar in others, sequences that differ in length

or sequences that share a conserved region or domain. The Smith-Waterman algorithm is

used to produce local alignments between pairs of DNA or protein sequences.



2.2.3 Semi-global Alignment

       Hybrid methods, known as semi-global or "glocal" methods, attempt to find the

best possible alignment that includes the start and end of one or the other sequence. This

                                             13
can be especially useful when the downstream part of one sequence overlaps with the

upstream part of the other sequence. In this case, neither global nor local alignment is

entirely appropriate: a global alignment would attempt to force the alignment to extend

beyond the region of overlap, while a local alignment might not fully cover the region of

overlap.



       Sequence alignment is useful for discovering functional, structural and

evolutionary information in DNA or protein sequences. It is important to obtain the best

possible, so-called “optimal” alignment to discover this information. Sequences that are

very much alike, probably have the same function, be it a regulatory role in the case of

similar DNA molecules or a similar biochemical function and three-dimensional structure

in the case of proteins. Also, if two sequences from different organisms are similar, there

may have been a common ancestor sequence; the sequences are then defined as being

homologous. Their alignment indicates the changes that could have occurred between the

homologous sequences and a common ancestor sequence during evolution. With the

advent of genome analysis and large scale sequence comparisons it becomes important to

recognize that sequence similarity may be an indicator of several possible types of

ancestor relationships; or there may not exist any ancestor relationship at all.



2.2.4 Methods for alignment

Alignment of two sequences is performed using the following methods [18].




                                             14
Dot Matrix analysis

       A dot matrix analysis is primarily a method for comparing two sequences to look

for possible alignment of characters between the sequences. Since this method displays

graphically any possible sequence alignments as diagonals on the matrix, it can readily

reveal the presence of insertions/deletions. The method is also used for direct or inverted

repeats in protein and DNA sequences and for predicting regions in RNA that are self-

complimentary.

       The major advantage of this method is that all possible matches of residues

between two sequences are found, leaving the investigator the choice of identifying the

most significant ones by examination of the dot matrix for long runs of matches, which

appear as diagonals. The major limitation of this method is that most of the dot matrix

computer programs do not show an actual alignment of the sequences.



Dynamic Programming approach

       The dynamic programming approach; first used for global alignment of protein

sequences by Needleman and Wunsch (1970) and for local alignment by Smith and

Waterman (1981); provides one or more alignments of the sequences. The algorithm

generates an alignment by starting at the ends of two sequences and attempting to match

all possible pairs of characters between the sequences and by following a scoring scheme

for matches, mismatches and gaps. This procedure generates a matrix of numbers that

represents all possible alignments between the sequences, where the highest set of

sequential scores in the matrix defines an optimal alignment.



                                            15
       For proteins, an amino acid substitution matrix, such as the Dayhoff Percent

Accepted Mutation Matrix 250 (PAM250) or Blosum Substitution Matrix (BLOSUM62)

is used to score matches and mismatches. Similar matrices are available for aligning

DNA sequences.

       The dynamic programming algorithm is guaranteed in a mathematical sense to

provide the optimal alignment for a given set of user-defined scoring matrix and gap

penalties. However the algorithm can be slow due to the very large number of

computational steps required, which increase approximately as the square or cube of the

sequence lengths.



Heuristic Methods

       The word or k-tuple methods are used by the FASTA and BLAST algorithms,

which align two sequences very quickly by first searching for identical short stretches of

sequences (called words or k-tuples) and then joining these words into an alignment by

the dynamic programming method. These methods are fast enough to be suitable for

searching an entire database for the sequence that aligns best with an input query

sequence. The FASTA and BLAST methods are heuristic; that is an empirical method of

computer programming in which rules of thumb are used to find solutions and feedback

is used to improve performance. These methods are reliable in a statistical sense and

usually provide the best scoring alignment that is possible.




                                            16
2.3 Gene Ontology

2.3.1 Formation of GO

       The exponential growth in the availability of molecular sequences, particularly

entire genomic sequences, has transformed both the theory and practice of experimental

biology and medicine. There was a time when the biologists and other research scientists

in the medical field characterized proteins by their diverse activities and abundances and

geneticists classified genes based on the phenotypes of their mutations. But today all

biologists and the other related research community acknowledge that it is likely that

there is just a single limited universe of genes and proteins, most of which are conserved

in most of the living cells. This acknowledgement has fueled a grand unification of

biology. Information about the biological role of such a shared protein in one organism

can certainly shed some light and also at times provide concrete inference of its role in

some other organism.

       This progress in the approach that the scientific community describes and

conceptualizes shared biological elements has not been able to keep up with high

throughput sequencing. As a result, the current systems of nomenclature for genes and

their products remain divergent even with the highly appreciated underlying similarities.

This has led to a lack of progress in the interoperability of genomic databases. This was

the basis of the formation of the Gene Ontology (GO) Consortium [19].

       The GO project is a collaborative effort to address the need for consistent

descriptions of gene products in different databases. The project began in 1998 as

collaboration between three model organism databases –

   1) FlyBase (Drosophila) - database for the fruitfly Drosophila melanogaster

                                           17
   2) Saccharomyces Genome Database (SGD) - database for the budding yeast

       Saccharomyces cerevisiae

   3) Mouse Genome Database (MGD) - databases for the mouse Mus musculus



       Since then, the GO Consortium has grown to include many databases, including

several of the world's major repositories for plant, animal and microbial genomes. The

GO project has developed three structured controlled vocabularies (ontologies) that

describe gene products in terms of their associated biological processes, cellular

components and molecular functions in a species-independent manner. There are three

separate aspects to this effort: first, the development and maintenance of the ontologies

themselves; second, the annotation of gene products, which entails making associations

between the ontologies and the genes and gene products in the collaborating databases;

and third, development of tools that facilitate the creation, maintenance and use of

ontologies [20].



2.3.2 Three Categories of Gene Ontology

The GO annotations are based on the following three principles.

   1. Cellular component

       Cellular component refers to the place in the cell where a gene product is active.

       Its ontology describes locations, at the levels of sub cellular structures and

       macromolecular complexes. Cellular component includes such terms as

       ‘ribosome’ or ‘proteasome’ specifying where multiple gene products would be

       found. It also includes terms such as ‘nuclear membrane’ or ‘Golgi apparatus’.

                                           18
   2. Biological process

       Biological process refers to a biological objective to which to which the gene or

       gene product contributes. A process is accomplished via one or more ordered

       assemblies of molecular functions. They often involve a physical or chemical

       transformation, meaning that something goes into a process and some other thing

       comes out of it. ‘Cell growth and maintenance’ or ‘signal transduction’ are some

       examples of high-level or broad biological process terms; whereas ‘translation’ or

       ‘pyrimidine metabolism’ are a couple of low-level or more specific biological

       process terms.



   3. Molecular function

       Molecular function refers to the biochemical activity of a gene product and also

       the capability that a gene product carries as a potential. These may include

       transporting things around, binding to things, holding things together and

       changing one thing into another. This is different from the biological processes

       the gene product is involved in, which involve more than one activity. Examples

       of broad functional terms are ‘enzyme’, ‘transporter’ or ‘ligand’. Examples of

       narrower functional terms are ‘adenylate cyclase’ or ‘toll receptor ligand’.



       Cellular components, biological processes and molecular functions are all

attributes of genes or gene products and each of them may be assigned independently. A

gene product might be associated with or located in one or more cellular components, it is

                                            19
Fig. 2.2 Cellular Component GO nodes
        Courtesy Xie et al. [24]




Fig. 2.3 Biological Process GO nodes
        Courtesy Xie et al. [24]

                20
                           Fig. 2.4 Molecular Function GO nodes
                                   Courtesy Xie et al. [24]



active in one or more biological processes, during which it performs one or more

molecular functions.

       The relationships between a gene product to cellular component, biological

process and molecular function are one-to-many, reflecting the biological reality that a

particular protein may function in several processes, contain domains that carry out

diverse molecular functions and participate in alternative interactions with other proteins,

organelles or locations in the cells.

       For example, the gene product ‘cytochrome c’ can be described by the molecular

function term ‘oxidoreductase activity’, the biological process terms ‘oxidative

phosphorylation’ and ‘induction of cell death’, and the cellular component terms

‘mitochondrial matrix’ and ‘mitochondrial inner membrane’.

                                            21
2.3.3 Standardized format

       The ontologies constituting GO are modeled as a generic category of graphs

known as directed acyclic graphs (DAGs), which have numerous applications in

computer science, such as Bayesian networks and parse trees created by compilers. Each

node in these graphs represents a specific functional label and is assigned a unique GO id

of the form GO: XXXXXXX, and each edge represents either an is:a or a part:of

relationship. This well-defined structure makes GO easily usable by both humans and

computers [20].



2.3.4 GO is the way to go

       Since the establishment of GO, many ontology-based sequence annotation

approaches have been developed including several web-based automated GO annotation

software tools. These attempts typically involve a search of homologous proteins in GO-

mapped databases including Genbank and Swiss-Prot.

       Hennig et al’s GOblet [21] is a fully integrated local analysis system comprising

of a GO parser which maps the GO hierarchy onto a set of linearized trees. Every query

sequence generates a complete set of relevant GO Ids obtained from BLAST and this set

is used to build a summary tree. The total counts per GO Id are also given to identify the

significant GO terms.

       Zehetner’s OntoBlast [22] presents a list of homologues together with their GO

terms. It uses genome data from nine species preprocessed for this tool. In this tool

available online, the use of BLAST is prominent. E-values signifying the similarity of the



                                           22
gene names in the list are used along with weighting numbers to sort the term list. They

give an indication of how strong the evidence for a term is relative to the other terms.

       Martin et al’s GOtcha searches a set of seven model genomes and returns scored

matches [23].

       Xie et al’s [24] GO Engine combines sequence homology search based on GO

annotations with text information analysis and cellular localization prediction to increase

the annotation accuracy.

       Schug et al [25] developed a rule-based function prediction method based on the

intersection of GO terms that contain protein domain at different similarity levels.

       Abascal et al [26] presented an automatic annotation method based on protein

family identification.

       Jensen et al [27] used neural networks for the prediction while Vinayagam et al.

[28] used support vector machines to increase the performance of the prediction tool.

       The appeal of these approaches is that they can directly assign a biological

meaning to an uncharacterized protein sequence.



2.4 Data Classification

       Classification is defined as a procedure in which individual items are placed into

groups based on quantitative information on one or more characteristics inherent in the

items and based on a training set of previously labeled items.

The goal of classification is to build a set of models that can correctly predict the class of

the different objects. The input to these methods is a set of objects (training data), the

classes which these objects belong to (dependent variables) and a set of variables

                                             23
describing different characteristics of the objects (independent variables). Once such a

predictive model is built, it can be used to predict the class of the objects for which class

information is not known a priori. The key advantage of supervised learning methods

over unsupervised methods like clustering is that by having an explicit knowledge of the

classes the different objects belong to, these algorithms can perform an effective feature

selection if that leads to better prediction accuracy.



Some of the major classification models are listed below.

   1. Classification by decision tree induction

   2. Bayesian Classification

   3. Neural Networks

   4. Support Vector Machines (SVM)

   5. Classification Based on Associations

   6. kNN Classification



2.5 Microarrays and Gene Expression

2.5.1 Microarray technology

       The traditional approach to genomic research was focused on the collection and

local examination of data on single genes. Today, microarray technologies have made it

possible to monitor the expression levels for tens of thousands of genes in parallel. A

microarray works by exploiting the ability of a given mRNA molecule to bind

specifically to, or hybridize to, the DNA template from which it originated. By using an

array containing many DNA samples, scientists can determine, in a single experiment,

                                              24
the expression levels of hundreds or thousands of genes within a cell by measuring the

amount of mRNA bound to each site on the array. With the aid of a computer, the amount

of mRNA bound to the spots on the microarray is precisely measured, generating a

profile of gene expression in the cell.

       A microarray is typically a glass slide on to which DNA molecules are fixed in an

orderly manner at specific locations called spots (or features). A microarray may contain

thousands of spots and each spot may contain a few million copies of identical DNA

molecules that uniquely correspond to a gene as shown in Fig. 2.5. The DNA in a spot

may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond

to a gene. The spots are printed on to the glass slide by a robot or are synthesized by the

process of photolithography. Microarrays may be used to measure gene expression in

many ways, but one of the most popular applications is to compare expression of a set of

genes from a cell maintained in a particular condition (condition A) to the same set of

genes from a reference cell maintained under normal conditions (condition B).

       Fig. 2.6 gives a general picture of the experimental steps involved. First, RNA is

extracted from the cells. Next, RNA molecules in the extract are reverse transcribed into

cDNA by using an enzyme reverse transcriptase and nucleotides labeled with different

fluorescent dyes. For example, cDNA from cells grown in condition A may be labeled

with a red dye and from cells grown in condition B with a green dye. Once the samples

have been differentially labeled, they are allowed to hybridize onto the same glass slide.

At this point, any cDNA sequence in the sample will hybridize to specific spots on the

glass slide containing its complementary sequence. The amount of cDNA bound to a spot



                                            25
              Fig. 2.5 A schematic representation of a microarray chip [29]



will be directly proportional to the initial number of RNA molecules present for that gene

in both samples. Following the hybridization step, the spots in the hybridized microarray

are excited by a laser and scanned at suitable wavelengths to detect the red and green

dyes. The amount of fluorescence emitted upon excitation corresponds to the amount of

bound nucleic acid. For instance, if cDNA from condition A for a particular gene was in

greater abundance than that from condition B, one would find the spot to be red. If it was

the other way, the spot would be green. If the gene was expressed to the same extent in

both conditions, one would find the spot to be yellow, and if the gene was not expressed

                                           26
in both conditions, the spot would be black. Thus, what is seen at the end of the

experimental stage is an image of the microarray, in which each spot that corresponds to

a gene has an associated fluorescence value representing the relative expression level of

that gene.




Fig. 2.6 Schematic for the experimental setup for any microarray gene expression
analysis [29]


       The microarray is scanned following hybridization and a TIFF image file is

normally generated. Once image generation is completed, the image is analyzed to

identify spots. We saw that the relative expression level for a gene can be measured as

                                           27
the amount of red or green light emitted after excitation. The most common metric used

to relate this information is called expression ratio. It is denoted here as Tk and defined as:

                                                 Rk
                                          Tk =
                                                 Gk

for each gene k on the array, where Rk represents the spot intensity metric for the test

sample and Gk represents the spot intensity metric for the reference sample. This

expression ratio undergoes transformation and normalization to bring out the differential

gene expression [29].



2.5.2 Gene Expression Data

       The processed data, after the normalization procedure, can then be represented in

the form of a matrix, often called gene expression matrix; the different forms of which

are shown in Table 2.1. Each row in the matrix corresponds to a particular gene and each

column could either correspond to an experimental condition or a specific time point at

which expression of the genes has been measured. The expression levels for a gene

across different experimental conditions are cumulatively called the gene expression

profile, and the expression levels for all genes under an experimental condition are

cumulatively called the sample expression profile. Once we have obtained the gene

expression matrix, additional levels of annotation can be added either to the gene or to the

sample. For example, the function of the genes can be provided, or the additional details

on the biology of the sample may be provided, such as disease state or normal state.




                                              28
                 Table 2.1 Representation of gene expression data [29]



To make any meaningful comparison or biological analysis, one should know what the

data in the gene expression matrix represents. Expression data can be represented in five

different ways, which are described below:



   1. Absolute measurement

       In the case of an absolute measurement, each cell in the matrix will represent the

       expression level of the gene in abstract units.




                                             29
2. Relative measurement or expression ratio

   In the case of a relative measurement or representations involving expression

   ratio, the expression level of a gene in abstract units is normalized with respect to

   its expression in a reference condition. This gives the expression ratio of the gene

   in relative units. Note that in such cases, a ratio of 4000/100 will lead to the same

   result as 40/10.



3. log2(expression ratio)

   In the case of tables representing the log2(expression ratio) values, information on

   upregulation and down-regulation is captured and is mapped in a symmetric

   manner. For example, 4-fold up-regulation maps to log2(4) = 2 and a 4-fold

   down-regulation maps to log2(1/4) = -2.



4. Discrete values

   Another way of representing information is to convert to discrete numbers the

   values in the tables mentioned above. In the case of converting the absolute

   measurement to discrete numbers, a binary expression matrix of 1 and 0 can be

   used, where 1 means that the gene is expressed above a user defined threshold,

   and 0 means that the gene is expressed below this threshold. In the case of making

   the relative expression tables or log2 (expression ratio) tables discrete, values can

   be divided into 3 classes, +1, 0 and        –1, where +1 represents a gene that is

   positively regulated, 0 represents a gene that is not differentially regulated and –1

   represents a gene that is repressed.

                                          30
   5. Representation of expression profiles as vectors

       So far we have seen how individual cells in the gene expression matrix can be

       represented. Similarly, an expression profile (of a gene or a sample) can be

       thought of as a vector and can be represented in vector space. For example, an

       expression profile of a gene can be considered as a vector in n dimensional space

       (where n is the number of conditions), and an expression profile of a sample with

       m genes can be considered as a vector in m dimensional space (where m is the

       number of genes).



2.6 Data Clustering

       One of the goals of microarray data analysis is to cluster genes or samples with

similar expression profiles together to make meaningful biological inference about the set

of genes or samples. Clustering is one of the unsupervised approaches to classify data

into groups of genes or samples with similar patterns that are characteristic to the group.

Clustering methods can be hierarchical, that is grouping objects into clusters and

specifying relationships among objects in a cluster, resembling a phylogenetic tree or

non-hierarchical, that is grouping into clusters without specifying relationships between

objects in a cluster as schematically represented in Fig. 2.6.




                                             31
                           Fig. 2.7 Various clustering methods



2.6.1 Hierarchical clustering: agglomerative

       In the case of a hierarchical agglomerative clustering, the objects are successively

fused until all the objects are included. For a hierarchical agglomerative clustering

procedure, each object is considered as a cluster. The first step is the calculation of

pairwise distance measures for the objects to be clustered. Based on the pairwise

distances between them, objects that are similar to each other are grouped into clusters.

After this is done, pairwise distances between the clusters are re-calculated, and clusters

that are similar are grouped together in an iterative manner until all the objects are

included into a single cluster. This information can be represented as a dendrogram,

where the distance from the branch point is indicative of the distance between the two

clusters or objects. Comparison of clusters with another cluster or an object can be

carried out using four approaches

                                            32
1.   Single linkage clustering (Minimum distance)

     In single linkage clustering, distance between two clusters is calculated as the

     minimum distance between all possible pairs of objects, one from each cluster. This

     method has an advantage that it is insensitive to outliers. This method is also known

     as the nearest neighbor linkage.



2.   Complete linkage clustering (Maximum distance)

     In complete linkage clustering, distance between two clusters is calculated as the

     maximum distance between all possible pairs of objects, one from each cluster. The

     disadvantage of this method is that it is sensitive to outliers. This method is also

     known as the farthest neighbor linkage.



3.   Average linkage clustering

     In average linkage clustering, distance between two clusters is calculated as the

     average of distances between all possible pairs of objects in the two clusters.



4.   Centroid linkage clustering

     In centroid linkage clustering, an average expression profile (called a centroid) is

     calculated.




                                            33
2.6.2 Hierarchical clustering: divisive

       Hierarchical divisive clustering is the opposite of the agglomerative method,

where the entire set of objects is considered as a single cluster and is broken down into

two or more clusters that have similar expression profiles. After this is done, each cluster

is considered separately and the divisive process is repeated iteratively until all objects

have been separated into single objects.



2.6.3 Non-hierarchical clustering

     Non-hierarchical clustering requires predetermination of the number of clusters.

Non-hierarchical clustering then groups existing objects into these predefined clusters

rather than organizing them into a hierarchical structure.



1.   Non-hierarchical clustering: K-means

     K-means is a popular non-hierarchical clustering method. In K-means clustering,

     the first step is to arbitrarily group objects into a predetermined number of clusters.

     The number of clusters can be chosen randomly or estimated by first performing a

     hierarchical clustering of the data. Following this step, an average expression

     profile (centroid) is calculated for each cluster, this is called initialization. Next,

     individual objects are reattributed from one cluster to the other depending on which

     centroid is closer to the gene. This procedure of calculating the centroid for each

     cluster and re-grouping objects closer to available centroids is performed in an

     iterative manner for a fixed number of times, or until convergence.



                                            34
2.   Non-hierarchical clustering: Self Organizing Maps




                               Fig. 2.8 Self-Organizing Maps



     Self Organizing Maps (SOMs) work in a manner similar to K-means clustering

     (Fig. 2.8). In K-means clustering, one chooses the number of clusters to fit the data,

     whereas with SOM the first step is to choose the number and orientation of the

     clusters with respect to each other. For example, a two-dimensional grid of nodes

     could be the starting point. The grid is projected onto the expression space and each

     object is assigned a node that is nearest to it – this is called initialization. In the next

     step, a random object is chosen and the node (called a reference vector) which is in

     the neighborhood of the object is moved closer to it. The other nodes are moved to a

     small extent depending on how close they are to the object chosen. In successive

     iterations, with randomly chosen objects, the positions of the nodes are refined and

     the radius of neighborhood becomes confined. In this way, the grid of nodes

     (initially a two-dimensional grid) is deformed to fit the data. The advantage of this

     method, unlike K-means, is that SOM does not force the number of clusters to be

     equal to the number of starting nodes in the chosen grid. This is because some

     nodes may have no objects associated with them when the map is complete. Other
                                         35
     advantages of SOM include providing information on the similarity between the

     nodes, and the ability of SOM to produce reliable results even with noisy data.



2.6.4 Predicting protein interactions and protein functions

       Integrating expression data with other external information, for example

evolutionary conservation of proteins, have been used to predict interacting proteins,

protein complexes, and protein function. Work by Ge et al. [30] and Jansen and Gerstein

[31] have shown that genes with similar expression profiles are more likely to encode

proteins that interact. When this information is combined with evolutionary conservation

of proteins, meaningful predictions can be made. In a recent work by Teichmann and

Madan Babu [32], it was shown that proteins that are evolutionarily conserved in yeast

and worm and that have similar expression profiles in both organisms tend to be a part of

the same stable complex or interact physically. Noort et al [33] have also shown that the

encoded proteins of conserved, co-expressed gene pairs are highly likely to be part of the

same pathway. Such studies enable us to predict specific gene functions.

       Clustering techniques have proven to be helpful to understand gene function, gene

regulation, cellular processes, and subtypes of cells. Genes with similar expression

patterns (co-expressed genes) can be clustered together with similar cellular functions.

This approach may further help in understanding the functions of many genes for which

information has not been previously available. Furthermore, co-expressed genes in the

same cluster are likely to be involved in the same cellular processes, and a strong

correlation of expression patterns between those genes indicates co-regulation. Searching

for common DNA sequences at the promoter regions of genes within the same cluster

                                            36
allows regulatory motifs specific to each gene cluster to be identified and cis-regulatory

elements to be proposed. The inference of regulation through the clustering of gene

expression data also gives rise to hypotheses regarding the mechanism of the

transcriptional regulatory network. Finally, clustering different samples on the basis of

corresponding expression profiles may reveal sub-cell types which are hard to identify by

traditional morphology-based approaches.




                                           37
                                     CHAPTER III

                            MATERIALS AND METHODS


3.1 Data Set

        The crucial step in developing a good prediction system is always to obtain a

good data set. In this study, we used the complete yeast (Sacchyromyces cerevisiae)

proteome.

    Before we started designing the algorithm, we downloaded and studied three major

data files

    1. Yeast.dat, which contains the sequence information for the entire proteome

    2. GOGroups.dat, which contains the annotation details for the Gene Ontology

    3. YeastMicroarray.dat, which contains the gene expression data for a complete

        yeast cell-cycle



3.1.1 Complete yeast proteome

        The July 2005 version of the complete proteome data consisting of 6247 proteins

was downloaded from the UniProt Knowledgebase (Swiss-Prot and TrEMBL) – Protein

Knowledgebase using the ExPASy (Expert Protein Analysis System) proteomics server

of the Swiss Institute of Bioinformatics (SIB) dedicated to the analysis of protein

sequences and structures [34].


                                           38
Swiss-Prot is an annotated protein sequence database, which was established in 1986 and

maintained collaboratively, since 1987, by the group of Amos Bairoch first at the

Department of Medical Biochemistry of the University of Geneva and now at the Swiss

Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL

Outstation- the European Bioinformatics Institute (EBI)).

           The Swiss-Prot Protein Knowledgebase consists of sequence entries. Each entry

corresponds to a single contiguous sequence as contributed to the bank or reported in the

literature. The entries are structured so as to be usable by human readers as well as

computer programs. The explanations, descriptions, classifications and other comments

are in English. Wherever possible, symbols familiar to biochemists, protein chemists and

molecular biologists are used. Each sequence entry is composed of lines. Different types

of lines, each with their own format, are used to record the various data that make up the

entry. Sequence entries are composed of different line types, each with its own format

[35].

For each sequence entry the core data consists of:

     •     The sequence data;

     •     The citation information (bibliographical references);

     •     The taxonomic data (description of the biological source of the protein).



A sample sequence entry is shown below.


ID       2A5D_YEAST     STANDARD;       PRT;  757 AA.
AC       P38903;
DT       01-FEB-1995 (Rel. 31, Created)
DT       01-OCT-1996 (Rel. 34, Last sequence update)
DT       10-MAY-2005 (Rel. 47, Last annotation update)
DE       Serine/threonine protein phosphatase 2A, 56 kDa regulatory subunit,
DE       delta isoform (PP2A, B subunit, B' delta isoform) (RTS1 protein) (SCS1

                                               39
DE   protein).
GN   Name=RTS1; Synonyms=SCS1; OrderedLocusNames=YOR014W; ORFNames=OR26.04;
OS   Saccharomyces cerevisiae (Baker's yeast).
OC   Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
OC   Saccharomycetales; Saccharomycetaceae; Saccharomyces.
OX   NCBI_TaxID=4932;
RN   [1]
RP   NUCLEOTIDE SEQUENCE.
RX   MEDLINE=96271526; PubMed=8846889;
RA   Evangelista C.C. Jr., Rodriguez Torres A.M., Limbach M.P.,
RA   Zitomer R.S.;
RT   "Rox3 and Rts1 function in the global stress response pathway in
RT   baker's yeast.";
RL   Genetics 142:1083-1093(1996).
RN   [2]
RP   NUCLEOTIDE SEQUENCE, AND SUBCELLULAR LOCATION.
RX   MEDLINE=96009589; PubMed=7565713;
RA   Shu Y., Hallberg R.L.;
RT   "SCS1, a multicopy suppressor of hsp60-ts mutant alleles, does not
RT   encode a mitochondrially targeted protein.";
RL   Mol. Cell. Biol. 15:5618-5626(1995).
RN   [3]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=S288c / FY1679;
RX   MEDLINE=97313270; PubMed=9169874;
RA   Dujon B., Albermann K., Aldea M., Alexandraki D., Ansorge W.,
RA   Arino J., Benes V., Bohn C., Bolotin-Fukuhara M., Bordonne R.,
RA   Boyer J., Camasses A., Casamayor A., Casas C., Cheret G.,
RA   Cziepluch C., Daignan-Fornier B., Dang D.V., de Haan M., Delius H.,
RA   Durand P., Fairhead C., Feldmann H., Gaillon L., Galisson F.,
RA   Gamo F.-J., Gancedo C., Goffeau A., Goulding S.E., Grivell L.A.,
RA   Habbig B., Hand N.J., Hani J., Hattenhorst U., Hebling U.,
RA   Hernando Y., Herrero E., Heumann K., Hiesel R., Hilger F., Hofmann B.,
RA   Hollenberg C.P., Hughes B., Jauniaux J.-C., Kalogeropoulos A.,
RA   Katsoulou C., Kordes E., Lafuente M.J., Landt O., Louis E.J.,
RA   Maarse A.C., Madania A., Mannhaupt G., Marck C., Martin R.P.,
RA   Mewes H.-W., Michaux G., Paces V., Parle-McDermott A.G., Pearson B.M.,
RA   Perrin A., Pettersson B., Poch O., Pohl T.M., Poirey R.,
RA   Portetelle D., Pujol A., Purnelle B., Ramezani Rad M., Rechmann S.,
RA   Schwager C., Schweizer M., Sor F., Sterky F., Tarassov I.A.,
RA   Teodoru C., Tettelin H., Thierry A., Tobiasch E., Tzermia M.,
RA   Uhlen M., Unseld M., Valens M., Vandenbol M., Vetter I., Vlcek C.,
RA   Voet M., Volckaert G., Voss H., Wambutt R., Wedler H., Wiemann S.,
RA   Winsor B., Wolfe K.H., Zollner A., Zumstein E., Kleine K.;
RT   "The nucleotide sequence of Saccharomyces cerevisiae chromosome XV.";
RL   Nature 387:98-102(1997).
RN   [4]
RP   FUNCTION, SUBUNIT, AND PHOSPHORYLATION.
RX   MEDLINE=97299674; PubMed=9154823;
RA   Shu Y., Yang H., Hallberg E., Hallberg R.;
RT   "Molecular genetic analysis of Rts1p, a B' regulatory subunit of
RT   Saccharomyces cerevisiae protein phosphatase 2A.";
RL   Mol. Cell. Biol. 17:3242-3253(1997).
CC   -!- FUNCTION: The B regulatory subunit might modulate substrate
CC       selectivity and catalytic activity, and also might direct the
CC       localization of the catalytic enzyme to a particular subcellular
CC       compartment.
CC   -!- FUNCTION: Multicopy suppressor of ROX3 and HSP60.
CC   -!- SUBUNIT: PP2A consists of a common heterodimeric core enzyme,
CC       composed of a 36 kDa catalytic subunit (subunit C) and a 65 kDa
CC       constant regulatory subunit (PR65 or subunit A), that associates
CC       with a variety of regulatory subunits. Proteins that associate
CC       with the core dimer include three families of regulatory subunits
CC       B (the R2/B/PR55/B55, R3/B''/PR72/PR130/PR59 and R5/B'/B56
CC       families), the 48 kDa variable regulatory subunit, viral proteins,
CC       and cell signaling molecules.
CC   -!- INTERACTION:
CC       P47135:JSN1; NbExp=1; IntAct=EBI-1931, EBI-9422;
CC       P53958:YNL042W; NbExp=1; IntAct=EBI-1931, EBI-28694;

                                           40
CC   -!- SUBCELLULAR LOCATION: Cytoplasmic and nuclear.
CC   -!- PTM: Phosphorylated.
CC   -!- SIMILARITY: Belongs to the phosphatase 2A regulatory subunit B
CC       family.
CC   --------------------------------------------------------------------------
CC   This SWISS-PROT entry is copyright. It is produced through a collaboration
CC   between the Swiss Institute of Bioinformatics and the EMBL outstation -
CC   the European Bioinformatics Institute. There are no restrictions on its
CC   use by non-profit institutions as long as its content is in no way
CC   modified and this statement is not removed. Usage by and for commercial
CC   entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC   or send an email to license@isb-sib.ch).
CC   --------------------------------------------------------------------------
DR   EMBL; U06630; AAB38372.1; -; Genomic_DNA.
DR   EMBL; S79635; AAB35312.1; -; Genomic_DNA.
DR   EMBL; X87331; CAA60763.1; -; Genomic_DNA.
DR   EMBL; Z74922; CAA99203.1; -; Genomic_DNA.
DR   PIR; S54620; S54620.
DR   IntAct; P38903; -.
DR   GermOnline; 143602; -.
DR   Ensembl; YOR014W; Saccharomyces cerevisiae.
DR   SGD; S000005540; RTS1.
DR   GO; GO:0005935; C:bud neck; IDA.
DR   GO; GO:0005737; C:cytoplasm; IDA.
DR   GO; GO:0005634; C:nucleus; IDA.
DR   GO; GO:0000159; C:protein phosphatase type 2A complex; TAS.
DR   GO; GO:0005816; C:spindle pole body; IDA.
DR   GO; GO:0000158; F:protein phosphatase type 2A activity; TAS.
DR   GO; GO:0006470; P:protein amino acid dephosphorylation; TAS.
DR   InterPro; IPR008938; ARM.
DR   InterPro; IPR002554; B56.
DR   Pfam; PF01603; B56; 1.
KW   Complete proteome; Hydrolase; Nuclear protein; Phosphorylation;
KW   Protein phosphatase.
FT   COMPBIAS      16     22      Poly-Ser.
FT   COMPBIAS      46     51      Poly-Ser.
FT   COMPBIAS      98    110      Poly-Ser.
FT   COMPBIAS     143    147      Poly-Ser.
FT   COMPBIAS     202    213      Poly-Asn.
FT   CONFLICT      95     95      T -> S (in Ref. 2).
FT   CONFLICT     529    529      L -> F (in Ref. 1).
FT   CONFLICT     581    581      A -> R (in Ref. 1).
SQ   SEQUENCE    757 AA; 85335 MW; 5A7476C30140331C CRC64;
     MMRGFKQRLI KKTTGSSSSS SSKKKDKEKE KEKSSTTSST SKKPASASSS SHGTTHSSAS
     STGSKSTTEK GKQSGSVPSQ GKHHSSSTSK TKTATTPSSS SSSSRSSSVS RSGSSSTKKT
     SSRKGQEQSK QSQQPSQSQK QGSSSSSAAI MNPTPVLTVT KDDKSTSGED HAHPTLLGAV
     SAVPSSPISN ASGTAVSSDV ENGNSNNNNM NINTSNTQDA NHASSQSIDI PRSSHSFERL
     PTPTKLNPDT DLELIKTPQR HSSSRFEPSR YTPLTKLPNF NEVSPEERIP LFIAKVDQCN
     TMFDFNDPSF DIQGKEIKRS TLDELIEFLV TNRFTYTNEM YAHVVNMFKI NLFRPIPPPV
     NPVGDIYDPD EDEPVNELAW PHMQAVYEFF LRFVESPDFN HQIAKQYIDQ DFILKLLELF
     DSEDIRERDC LKTTLHRIYG KFLSLRSFIR RSMNNIFLQF IYETEKFNGV AELLEILGSI
     INGFALPLKE EHKVFLVRIL IPLHKVRCLS LYHPQLAYCI VQFLEKDPLL TEEVVMGLLR
     YWPKINSTKE IMFLNEIEDI FEVIEPLEFI KVEVPLFVQL AKCISSPHFQ VAEKVLSYWN
     NEYFLNLCIE NAEVILPIIF PALYELTSQL ELDTANGEDS ISDPYMLVEQ AINSGSWNRA
     IHAMAFKALK IFLETNPVLY ENCNALYLSS VKETQQRKVQ REENWSKLEE YVKNLRINND
     KDQYTIKNPE LRNSFNTASE NNTLNEENEN DCDSEIQ
//



Each line begins with a two-character line code, which indicates the type of data

contained in the line. The current line types and line codes and the order in which they

appear in an entry, are shown in the Table 3.1.


                                            41
Line code       Content                              Occurrence in an entry
ID              Identification                       Once; starts the entry
AC              Accession number(s)                  Once or more
DT              Date                                 Three times
DE              Description                          Once or more
GN              Gene name(s)                         Optional
OS              Organism species                     Once
OG              Organelle                            Optional
OC              Organism classification              Once or more
OX              Taxonomy cross-reference             Once
RN              Reference number                     Once or more
RP              Reference position                   Once or more
RC              Reference comment(s)                 Optional
RX              Reference cross-reference(s)         Optional
RG              Reference group                      Once or more (Optional if RA line)
RA              Reference authors                    Once or more (Optional if RG line)
RT              Reference title                      Optional
RL              Reference location                   Once or more
CC              Comments or notes                    Optional
DR              Database cross-references            Optional
KW              Keywords                             Optional
FT              Feature table data                   Optional
SQ              Sequence header                      Once
(blanks)        Sequence data                        Once or more
//              Termination line                     Once; ends the entry

           Table 3.1 Line types and line codes for Swiss-Prot sequence entry [35]



The two-character line-type code that begins each line is always followed by three

blanks, so that the actual information begins with the sixth character. In general,

information is not extended beyond character position 75, there are however a few

exceptions where lines may be longer. Since the entire database follows the same

standardized format, it is computer readable and can be programmed to access this

database information and process it further for the analysis.

Out of this information about one protein sequence, only the information on


                                               42
     1. AC – accession number,

     2. DR – database cross-reference and

     3. SQ – sequence

is actually used for the prediction algorithm.



The AC (Acession number) line lists the accession number(s) associated with an entry.

The format of the AC line is:
AC    AC_number_1;[ AC_number_2;]...[ AC_number_N;]

An example of an accession number line is shown below:
AC    P00321;

Semicolons separate the accession numbers and a semicolon terminates the list. If the

information is unable to fit in a single line, more than one AC lines can be used.

The accession number is used by the prediction algorithm to uniquely identify each

protein and get its annotation from the Gene Ontology (GO).



The DR (Database cross-Reference) lines are used as pointers to information related to

entries and found in data collections other than Swiss-Prot.

The format of the DR line is:
DR                 DATABASE_IDENTIFIER;      PRIMARY_IDENTIFIER;     SECONDARY_IDENTIFIER[;

TERTIARY_IDENTIFIER][; QUATERNARY_IDENTIFIER].

The first item on the DR line, the 'DATABASE_IDENTIFIER', is the abbreviated name

of the data collection to which reference is made. The second item on the DR line, the

'PRIMARY_IDENTIFIER', is an unambiguous pointer to the information entry in the

database to which the reference is being made. The third item on the DR line, the


                                             43
'SECONDARY_IDENTIFIER', is generally used to complement the information given

by the first identifier. A number of DR lines possess a fourth item, the

'TERTIARY_IDENTIFIER', which is generally used to give more detailed information.

Since we are using the GO annotation, we deal with the DR lines with the

'DATABASE_IDENTIFIER' as GO. For GO, the 'PRIMARY_IDENTIFIER' gives us the

primary GO id which gives is actually referenced in the GO database. The

'SECONDARY_IDENTIFIER' is either one of the characters C, F or P for the 3 principal

categories of the GO Database- C: cellular component; F: molecular function; and P:

biological process.
DR   GO; GO:0005816; C:spindle pole body; IDA.
DR   GO; GO:0000158; F:protein phosphatase type 2A activity; TAS.
DR   GO; GO:0006470; P:protein amino acid dephosphorylation; TAS.

Only molecular function (F) ontology is considered for the prediction model.



The SQ (sequence data) line has a line code consisting of two blanks rather than the two-

letter codes used until now. The sequence counts 60 amino acids per line, in groups of 10

amino acids, beginning in position 6 of the line.

An example of sequence data preceded by the corresponding SQ line is shown here:
SQ   SEQUENCE   97 AA; 9110 MW; E3C20C259858B830 CRC64;
     MTILASICKL GNTKSTSSSI GSSYSSAVSF GSNSVSCGEC GGDGPSFPNA SPRTGVKAGV
     NVDGLLGAIG KTVNGMLISP NGGGGGMGMG GGSCGCI




The // (terminator) line contains no data or comments and designates the end of an entry.



3.1.2 GO Annotations

       A GO definition file was obtained from the Gene Ontology consortium web site

[36]. In the version of July 2005, there are 19094 GO terms including
                                            44
       •   9856 biological process terms,

       •   7559 molecular function terms, and

       •   1679 cellular component terms.

       The GO annotation file is available in the OBO file format and is maintained in

ASCII doc format in the Gene Ontology CVS repository. The OBO file format itself

attempts to achieve the goals of human readability, ease of parsing, extensibility and

minimal redundancy.

       The OBO file format consists of a <header> and <stanza>s. The header is an

unlabeled section at the beginning of the document containing tag-value pairs. A stanza is

a labeled section of the document, indicating that an object of a particular type is being

described. Stanzas consist of a stanza name in square brackets, and a series of tag-value

pairs. Tag-value pairs consist of a tag name, a colon, the tag value, and a newline.
<tag>: <value> {<trailing modifiers>} ! <comment>

The tag name is always a string. The value is always a string, but the value string may

require special parsing depending on the tag with which it is associated. In general, tag-

value pairs occur on a single line.

A sample stanza entry is shown.
[Term]
id: GO:0000006
name: high affinity zinc uptake transporter activity
namespace: molecular_function
def: "Catalysis of the reaction: Zn2+(out) = Zn2+(in), probably powered by proton motive
force." [TC:2.A.5.1.1]
xref_analog: TC:2.A.5.1.1
is_a: GO:0005385 ! zinc ion transporter activity


At present, every OBO stanza always begins with an id tag. The value of the id tag

announces the object to which the rest of the tags in the stanza refer. Normal, non-

anonymous ids have global scope. An object has the same id in every file, and in every

                                            45
namespace. In our parsing system for using the GO annotation, we just deal with some of

the tags mentioned below.


id             The unique id of the current term
name           The term name. Any term may have only one name defined
namespace      The ontology to which the term belongs
alt_id         Defines an alternate id for this term
def            The definition of the current term
xref           A dbxref that describes an analagous term in another vocabulary
is_a           This tag describes a subclassing relationship between one term and
               another. The value is the id of the term of which this term is a subclass. A
               term may have any number of is_a relationships.
relationship   This tag describes a typed relationship between this term and another term.
is_obsolete    Whether or not this term is obsolete.



3.1.3 Yeast Gene Expression Data

       The second kind of yeast data used was the gene expression data for a complete

yeast cell cycle. The availability of complete sequence for the Saccharomyces cerevisiae

genome has made it possible to measure mRNA transcript levels for virtually every yeast

gene. In the study by Cho et al. [37], commercially available high-density oligonucleotide

arrays were used to estimate mRNA transcript levels in synchronized yeast cells at

regular intervals during the cell cycle. DNA oligonucleotide probes are directly

synthesized on these arrays without individual manipulation or PCR amplification,

minimizing the potential for cross-hybridization or clone error. The data was generated

by the Yeast Cell Cycle Analysis Project by Spellman et al. [38], which is available at the

Genomics server of Stanford University Medical Center [39].




                                            46
        The dataset includes 24 gene expression profiles measuring the mRNA expression

levels of the proteins during the growth of yeast cells. Each profile includes 6178 related

proteins. A small part of the micro array data is shown in Table 3.2. The rows represent

different genes while the columns represent the gene expression levels at different time

instants.



               cdc15_10   cdc15_30   cdc15_50    cdc15_70   cdc15_80   cdc15_90   cdc15_100
     YBR217W      -0.26      -0.48      -0.07       -0.34       0.23      -0.48        0.35
     YBR218C       0.08       0.13      -0.35       -0.25      -0.25      -0.23       -0.16
     YBR219C       0.26       0.17       0.18        0.15          0       0.04       -0.42
     YBR220C       0.25      -0.09       0.09       -0.14       0.33      -0.14         0.4
     YBR221C      -0.56      -0.73       -0.5       -0.18       0.05       0.01       -0.46
     YBR222C      -0.84      -0.79      -0.02        0.08       0.48       0.26        0.25
     YBR223C       0.13       0.18       0.18       -0.25      -0.21      -0.04       -0.32
     YBR224W       -0.2       0.24       0.02       -0.62       0.17      -0.44        0.27
     YBR225W        0.2       0.34        0.2        -0.1      -0.16       0.07       -0.07
     YBR226C       0.23       0.42        0.1        -0.1       0.11      -0.05        0.25
     YBR227C       0.29       0.62       0.68       -0.01      -0.61       0.27       -0.29
     YBR228W       0.11       0.25      -0.04       -0.32       0.21      -0.22        0.07
     YBR229C       0.02      -0.16      -0.16       -0.06      -0.09       0.15       -0.42
     YBR230C       -0.3      -0.69      -0.56       -0.88       0.68       -0.2        0.63
     YBR231C          0       0.37      -0.11        -0.5       -0.1      -0.36        0.03
     YBR232C      -0.04       0.29          0       -0.55      -0.06      -0.61       -0.09
     YBR233W      -0.06       0.22       0.19       -0.26       -0.1      -0.52        0.14
     YBR234C      -0.53       -0.8      -0.76       -0.29       0.49      -0.24        0.17
     YAL020C      -0.38      -0.42      -0.28       -0.15       0.16      -0.12        0.26
     YAL021C                  0.17       0.22       -0.09       0.19      -0.17        0.16



                          Table 3.2 Yeast gene expression data [39]



The data was recorded over a complete yeast cell cycle. The columns represent different

time slots varying from 10 minutes to 270 minutes. The rows represent the different

genes and the data is the gene expression levels in the form of logarithmic ratios.




                                                47
3.2 Data pre-processing

       The files downloaded from the various servers are not used in their entirety. So

they need to be pre-processed to extract the actual information that is necessary and also

to get them in the appropriate format to be processed easily and in a time-efficient

manner.

       From the Yeast Proteome file (Yeast.dat) we extract the AC line, the DR line with

the GO database reference to the molecular function (F) ontology and the SQ line with

the complete sequence. The GO ids for the annotated molecular functions are stored

along with the protein accession id in a separate file. The protein accession id along with

the sequence is stored in another file. The first file forms a lookup table to look up for the

GO annotation for any yeast protein. The sequences in the latter file are used in the

BLAST program to obtain the similarity scores among all the sequences.

       Similarly, from the Gene Ontology file, only the required information like the GO

id and the “is_a” or “part_of” information are extracted for the molecular function

ontology to form a file from which a tree structure of the entire GO annotation can be

built. The obsolete terms are not considered and the alternate ids (alt_id) are also

recorded if they exist.

Perl codes were written to perform all these pre-processing operations.

       In case of the gene expression data, it was found out that there were a lot of genes

with missing data. So the genes with 50% or more of missing data were not considered.

As a result 516 genes were eliminated leaving a data of 5662 genes.

The pre-processing of the gene expression data was performed using MathWorks

MATLAB® R2006a and Microsoft® Excel 2003.

                                             48
3.3 Sequence similarity scoring

       Sequence similarities can be measured through pair-wise global alignments or

local alignments. Homologous protein sequences are usually similar over active domains

and thus share common folds and functions. Therefore, local alignment is a more

appropriate method for comparing protein sequences for their functional similarity. There

are several local alignment schemes for comparing protein sequences, including BLAST

that can be used together with different scoring systems such as BLOSUM62 and

BLOSUM80.



3.3.1 BLAST Algorithm




                            Fig. 3.1 BLAST Algorithm [40]



                                           49
The BLAST algorithm is a heuristic search method that seeks words of length W (default

= 3 in blastp) that score at least T when aligned with the query and scored with a

substitution matrix. Words in the database that score T or greater are extended in both

directions in an attempt to find a locally optimal ungapped alignment or HSP (high

scoring pair) with a score of at least S or an E value lower than the specified threshold.

HSPs that meet these criteria will be reported by BLAST, provided they do not exceed

the cutoff value specified for number of descriptions and/or alignments to report [40].



       The program returns a list of local alignments of certain statistical significance.

However, how to measure the overall similarity of two protein sequences is not obvious.

For example, proteins with two similar domains with certain similarity scores could be

considered to be much more similar than proteins with only one domain with a higher

similarity score. In this study, a novel measure of overall similarity of two protein

sequences is introduced. We utilize the alignment tool for BLASTing two sequences to

obtain the list of optimal local alignments. Let {S1, …,Sn} be scores of a list of best local

alignments with certain statistical significance.     Instead of using the highest score

(max{S1, …,Sn}) in the list to measure the overall similarity of the two protein sequences,

we use the following score S to measure the overall similarity of two sequences [41].

                           S = −∑ ln pi ,                      (1)
                                  i



where pi = 1 − e − E stands for the probability of finding a high-scoring segment pair (HSP)
                i




with a local alignment score of at least Si, and Ei is the expected number of HSPs of score



                                             50
at least Si and can be obtained directly from the alignment tool. Assuming the HSPs are

independent of each other, the p-value:

                            p = ∏ pi = e − S                    (2)
                                  i



measures the probability of finding a pair of protein sequences with a list of scores at

least {S1, …, Sn}. The corresponding E-value for the overall similarity (the expected

number of the lists that have scores at least {S1, …, Sn}) therefore can be written as:

                           E = − ln(1 − e − S ) .               (3)


We use the p-values and the E-values to measure the overall similarity of a pair of protein

sequences. Since when | x |< 1 , − ln(1 − x) = x + x 2 / 2 + O( x 3 ) , the E- and p-values are

essentially the same when they are small. For example, when p = 10−5 , p − E is of

order 10−10 . For convenience, p-values are used for the further analysis. The alignment

tool used for blasting two sequences (bl2seq) was retrieved from the NCBI ftp site [42].

We used version 2.2.11 with default parameters and the substitution matrix BLOSUM62.

All processing scripts were written in Perl.




3.4 Classification

    The k-nearest neighbor method is a standard learning method. The training examples

of different groups are mapped into a feature space. An uncharacterized sample is

assigned to a specific group if the group label is the most frequent label among the k

nearest training samples. K-nearest neighbor method is one of the most popular methods


                                                    51
for classification and has been heavily investigated in the fields of statistic and pattern

recognition. However, the accuracy of the method can be severely reduced by the

presence of noisy, irrelevant features or inappropriate scaling of the features. In this

study, we develop a weighted k-nearest neighbor method. The weight of a feature is

calculated based on the protein sequence similarities measured using Equation (1).



kNN Classifier -

       Even though they are simple, kNN classifiers are one of the best performers for a

variety of classification problems. As they do not make any assumptions regarding the

underlying training data, they are successful when the decision boundaries are irregular

or when a particular class has multiple prototypes. Since the class boundaries are

inherently vague and a simple model in case of gene-function-prediction problems cannot

categorize many classes, the flexibility of the kNN classifier plays a significant role.

The prime idea of the classical kNN classifier is the following:

   1. Design a set of numerical features to describe each data point.

   2. Select a metric, e.g. Euclidean distance to measure the similarity among the data

       points based on all the designed features.

   3. For a target point, determine its k closest points in the sample set based on the

       similarity metric.

   4. Assign the target point to a particular class by a majority vote of its neighbors.




                                             52
Fig. 3.2 The graphical representation of a kNN Classifier algorithm with k=5.
Three clusters (ω1, ω2, ω3) represent different apriori groups that cluster in space.
Demensionality is reduced into 2 distance axes for simplicity. Four of five nearest
neighbors of “x” are in ω1 so x is placed in ω1


       The advantages of kNN classifier is that it is well suited for multi-modal, that is

classes consisting of objects whose independent variables have different characteristics

for different subsets, as its classification decision is based on a small neighborhood of

similar objects.

       For the weighted kNN, instead of using the similarity metric score directly, we

consider its negative logarithm to enhance the contribution of the closer neighbors.

    We implemented the weighted k-nearest neighbor method based on the overall

similarity p-values of protein pairs (Equation 1). Given an uncharacterized protein i, a list

of GO annotations of the k-nearest neighbor proteins {GOl} is retrieved. The GO terms

are then ranked based on the magnitude of the p-values:


                      rank (GOl ) =          ∑             − log( pij )   (3)
                                           1≤ j ≤ k
                                         protein j is
                                      annotated with GOl


                                                    53
where protein j is one of the k-nearest neighbors and pij is the similarity p-value of the

protein pair i and j. The value of −300 is assigned to the variable term log(pij) when pij is

zero. The GO term with highest ranking will be assigned to protein i. In the case that two

or more GO terms are of the same ranking score, the value of k is increased by 1 at a time

until the tie is broken. The main advantages of weighted k-nearest neighbor method is

that the method is much more stable with the change of k and it significantly reduces the

noise that might be introduced by proteins that are not similar to the query protein. To

compare the improvement of the weighted k-nearest neighbor method, we implemented

the regular k-nearest neighbor method and the rank for a GO group was calculated as the

number of proteins belonging to that GO group.



3.5 Gene Expression Correlation and Clustering

       In the second approach to protein function prediction, we use gene expression

data. We use the Pearson Correlation Coefficient as a distance measure for the kNN

classifier algorithm.

Suppose we have two variables X and Y, with means X and Y respectively and standard

deviations SX and SY respectively. The Pearson correlation is computed as

                                    ∑ (X              − X )(Yi − Y )
                                        n

                                 r=     i =1      i

                                               (n − 1)S X S Y
Suppose that an X value was above average, and that the associated Y value was also

above average. Then the product

                                      (X   i   − X )(Yi − Y )



                                                  54
would be the product of two positive numbers which would be positive. If the X value

and the Y value were both below average, then the product above would be of two

negative numbers, which would also be positive.

Therefore, a positive correlation is evidence of a general tendency that large values of X

are associated with large values of Y and small values of X are associated with small

values of Y.

Suppose that an X value was above average, and that the associated Y value was instead

below average. Then the product

                                      (X   i   − X )(Yi − Y )

would be the product of a positive and a negative number which would make the product

negative. If the X value was below average and the Y value was above average, then the

product above would also be negative.

Therefore, a negative correlation is evidence of a general tendency that large values of X

are associated with small values of Y and small values of X are associated with large

values of Y.

       So a correlation coefficient like the Pearson factor helps in detecting any trends, if

any, in the gene expression data over the cell cycle.

       Since we need a direct positive correlation between the genes, we sort the

correlation coefficients and use the first k coefficients for the kNN classifier. We make

use of the functions in the Statistics toolbox of MATLAB® R2006a to obtain the Pearson

coefficients. Problems were encountered when the functions were directly used on the

5662 genes in the pre-processed file as the Pearson coefficient matrix took a size of 5662

X 5662. So a program was written separately to select a protein, pair it up with the rest of
                                                  55
the proteins, get their correlation coefficient based on the 24 samples and then obtain the

top k correlated proteins with their correlation coefficients.


3.6 A Novel Approach



                                  Protein with unknown function


                                  Retrieve k most similar proteins
                                       (k-nearest neighbors)


                               Obtain GO annotations of the neighbor
                               proteins and all the ancestors GO terms


                             Calculate the ranks of related GO terms at
                                  different levels of the GO tree


                            Predict GO annotations to the uncharacterized
                               protein and report the confidence level


                         Fig. 3.3 Schematic for the novel approach


       We first performed all-to-all pair-wise protein sequence local alignments using

the alignment tool (bl2seq) for comparing two sequences. The p-values were then

calculated based on Equation (1). The weighted k-nearest neighbor method based on

sequence alignment was implemented using Equation (3). The regular k-nearest neighbor

method was implemented using Equation (4). The performances of the methods are

examined using the leave-one-out cross-validation scheme. The scheme selects one

protein from the original set of 3247 proteins as the validation data and the remaining

proteins 3246 as the training data. Prediction is performed on the selected protein. We




                                                56
repeated the process for every protein in the dataset. The accuracy was then calculated

based on the percentage of correct predictions.

       Percentage Prediction Accuracy = Number of correct predictions * 100
                                             Total predictions



       In the second approach, where we used the gene expression data, the Pearson

correlation factors were used as the distance measures for the kNN and the weighted kNN

algorithms.




                                            57
                                     CHAPTER IV

                            RESULTS AND DISCUSSIONS


       Among the 6467 protein sequences,

       •   4175 are annotated with 1084 biological process terms

       •   3317 are annotated with 1060 molecular function terms

       •   4735 are annotated with 354 cellular component terms

The biological process ontology tree consists of 11 levels and 11091 GO nodes, of which

903 are unique.

The molecular function ontology tree consists of 9 levels and 471 GO nodes, of which

362 are unique.

The cellular component tree has 7 levels and 1692 GO notes, of which 284 are unique.

       Since we are focusing on the protein function prediction, only molecular function

groups and the yeast protein sequences annotated with molecular function terms are

considered in this study.

       Fig. 4.1 shows the number of GO terms at different GO levels for the three

different ontologies. As you go deeper in the levels, the annotations become more precise

and specific.




                                           58
                     250
                                                  biological process
                                                  molecular function
                     200
                                                  cellular component
Number of groups




                     150


                     100


                              50


                                  0
                                          0       1    2      3        4    5         6      7      8    9   10   11
                                                                            GO level



                                      Fig. 4.1 Numbers of GO groups at different levels of the gene ontologies



                     10000

                                                                                    biological process
                                                                                    molecular function
                             1000                                                   cellular component
                   Average size




                                  100




                                  10




                                      1
                                              0   1     2      3       4        5      6     7      8    9   10   11
                                                                                GO level

                                      Fig. 4.2 Average size of GO groups at different levels of the ontologies


                                                                           59
       Fig. 4.2 shows the average size of the GO groups at different levels. It includes

the number of proteins annotated by GO terms at a particular level. The trend is that as

the level increases and the GO terms become more specific, the average size reduces.



       The GO terms in the molecular function ontology category were parsed and stored

in a tree structure similar to the one used in AmiGO [43] to form a GO tree. Since GO

terms are originally organized in a DAG, a GO term may have several parent terms. In

this case, the child term appears multiple times on the same or different levels of the tree.

Protein sequences were then mapped onto the trees. The groups of less than six proteins

were removed for statistically meaningful results. Fig. 4.3 shows a protein mapped

molecular function ontology tree. The number in the parentheses after each functional

group indicates the number of proteins mapped to the group. For example, 392 proteins

are annotated with the transporter activity function group.




                                             60
Fig. 4.3 GO Tree for molecular function ontology

                      61
Figs. 4.4 – 4.6 show the percentage accuracies at the first, second and third levels of the

molecular function group of the GO ontology. The accuracies for the kNN and the

weighted kNN algorithm are plotted for k values varying from 2 to 18 in intervals of 2.

The 95% confidence intervals (bootstrap randomization) are also shown for each

accuracy point.

       As we clearly see, the weighted k-nearest neighbor method increases prediction

accuracy significantly. Figure 4.4 shows a 7% improvement for the prediction of the

biological functions at the first level of GO tree, when k = 4 and the accuracy level is

70% for the regular k-nearest neighbor method. The accuracy of the weighted k-nearest

neighbor method slightly increases from 75% to 76% and is stabilized at this level. On

the other hand, when k ≥ 4, we see a steady decrease of the accuracy obtained from the

regular k-nearest neighbor method as more irrelevant proteins are included in the

prediction. The differences between the weighted and regular k-nearest neighbor methods

are much more pronounced. For GO terms on the second level of GO tree, the prediction

accuracy of the weighted k-nearest neighbor method increases steadily to about 60%.

Comparing with regular k-nearest neighbor method, the accuracy is increased by about

25% when k = 10. We do see a slight decrease in accuracy for GO terms on level 3.

However, the accuracy of the weighted k-nearest neighbor method is clearly stable as k

changes. On the other hand, the performance of the regular k-nearest neighbor method

degenerates as k increases.

       Table 4.1 shows the p-values for the pair wise comparison of the percentage

accuracies at each level and for different k values. The p-values were obtained from

randomization test for increase in accuracy of weighted kNN prediction relative to kNN

                                            62
                                   Prediction Accuracy for First Level

                                               KNN Classifier        WKNN Classifier


                       80


 Percentage Accuracy   70


                       60


                       50


                       40


                       30
                            2       4      6         8          10    12       14      16   18
                                                         K Values




                                 Fig. 4.4 Percentage Accuracy at first level




                                 Prediction Accuracy for Second Level

                                               KNN Classifier        WKNN Classifier


                       80
Percentage Accuracy




                       70


                       60


                       50


                       40


                       30
                            2       4      6         8          10    12       14      16   18
                                                         K Values




                                Fig. 4.5 Percentage Accuracy at second level

                                                         63
                                     Prediction Accuracy for Third Level

                                                  KNN Classifier        WKNN Classifier

                           80

                           70
     Percentage Accuracy




                           60

                           50

                           40

                           30

                           20
                                2      4      6         8          10     12      14      16       18
                                                            K Values




                                    Fig. 4.6 Percentage Accuracy at third level




     k                                     Level 1                      Level 2                  Level 3
     2                                     ≤ 0.001                      0.0039                   0.0002
     4                                     ≤ 0.001                      ≤ 0.001                  ≤ 0.001
     6                                     ≤ 0.001                      ≤ 0.001                  ≤ 0.001
     8                                     ≤ 0.001                      ≤ 0.001                  ≤ 0.001
    10                                     ≤ 0.001                      ≤ 0.001                  ≤ 0.001
    12                                     ≤ 0.001                      ≤ 0.001                  ≤ 0.001
    14                                     ≤ 0.001                      ≤ 0.001                  ≤ 0.001
    16                                     ≤ 0.001                      ≤ 0.001                  ≤ 0.001
    18                                     ≤ 0.001                      ≤ 0.001                  ≤ 0.001
χ2 combined                            ≤ 9.84495E-32                ≤ 3.3583E-30               ≤ 1.9206E-31
 probability

                            Table 4.1 P-values for test of percentage accuracies


                                                            64
prediction. The combined probability for each level is calculated as χ2 probability for

-2log likelihood.




           Fig. 4.7 GeneCluster – Web tool for clustering based on SOM [44]



A web-based tool: GeneCluster [44] was used to implement the SOM based clustering.

Fig.4.7 shows the interface for the setup of the SOM parameters. In this instance, we

performed a 4 X 3 clustering to get a total of 12 clusters shown in Fig. 4.8.

                                             65
              Fig. 4.8 Resulting 12 clusters from a 4 X 3 clustering scheme



       The weighted kNN algorithm was applied to these 12 different clusters with a set

k value of k = 10. The percentage accuracies are shown in Fig. 4.9.

Table 4.2 shows the number of proteins in each cluster and the numeric value of the

percentage accuracy.


                                           66
                                        Percentage Accuracy for 4X3 Cluster

                      100


                      80
Percentage Accuracy




                      60


                      40


                      20


                       0
                            1       2      3    4     5        6   7   8      9     10   11   12
                                                           Clusters



                            Fig. 4.9 Percentage accuracy for 4 X 3 clustering scheme




                            Cluster No         Number of genes         % Accuracy
                                  1                 177                    60.45
                                  2                 358                    63.41
                                  3                 356                    62.64
                                  4                 197                    71.57
                                  5                 233                    71.67
                                  6                 55                     76.36
                                  7                 202                    57.92
                                  8                 44                     88.64
                                  9                 64                     84.38
                                  10                264                    66.29
                                  11                192                    66.15
                                  12                 34                    61.76

                                Table 4.2 Cluster details for 4 X 3 clustering scheme




                                                          67
                                     Percentage Accuracy for 2X3 Cluster

                      100
Percentage Accuracy
                       80


                       60


                       40


                       20


                        0
                               1            2           3              4             5   6
                                                            Clusters




                            Fig. 4.10 Percentage accuracy for 2 X 3 clustering scheme




                               Cluster No       Number of genes            % Accuracy

                                    1                398                     57.29
                                    2                109                     55.05
                                    3                176                     72.73
                                    4                349                     72.49
                                    5                704                     69.03
                                    6                443                     73.59

                              Table 4.3 Cluster details for 2 X 3 clustering scheme




                                                        68
                                          Percentage Accuracy for 2X5 Cluster


                      100


                      80
Percentage Accuracy




                      60


                      40


                      20


                       0
                             1        2        3     4      5      6      7           8   9   10
                                                           Clusters


                            Fig. 4.11 Percentage accuracy for 2 X 5 clustering scheme




                                  Cluster No       Number of genes      % Accuracy

                                       1                 435                  57.93
                                       2                 292                  65.75
                                       3                 58                   68.97
                                       4                 335                  60.60
                                       5                 293                  78.50
                                       6                 364                  70.33
                                       7                 62                   88.71
                                       8                 238                  68.07
                                       9                 62                   62.90
                                      10                 43                   90.70

                                 Table 4.4 Cluster details for 2 X 5 clustering scheme

                                                          69
       A 2 X 3 clustering scheme as well as a 2 X 5 clustering scheme were also

demonstrated as shown in Fig. 4.7 and 4.8 and Table 4.3 and 4.4.

A function prediction was done on the entire gene expression data without the clustering

at k =10. A percentage accuracy of 55.53 was obtained. It is easily seen that the

clustering approach does improve the prediction accuracy.

       It is also seen that for clusters with less number of genes, higher percentage

accuracy is obtained.




                                          70
                                     CHAPTER V

                                   CONCLUSIONS


       In this study, we developed a weighted k-nearest neighbor classification system

for automated protein function annotations. The classification system was tested on the

dataset of complete yeast proteome. The system identifies homologous proteins using a

novel measure of the overall similarity between two protein sequences. The biological

functions of proteins are described using GO annotations. The function prediction is

performed using a weighted k-nearest neighbor method. The prediction results indicate

that there is a strong correlation between proteins amino acid sequences measured by

Equation (1) and their biological functions. The weighted k-nearest neighbor method

significantly improves the prediction accuracy when compared to regular k-nearest

neighbor method. Furthermore, we see that the weighted k-nearest neighbor method is

very stable algorithm with respect to the changes of k. We conclude that the method

significantly outperformed the regular k-nearest neighbor method for automated protein

function prediction.

       From the p-values presented in Table 4.1, we reject the null hypothesis and we

accept the alternate hypothesis. Thus, there is a significant difference between the

percentage accuracies for the weighted kNN and the classical kNN algorithm. It can be

seen that the percentage accuracy is higher in case of the weighted kNN algorithm.


                                           71
        In the second part of the project, where we use the clustering approach, it is seen

that that the clustering method does improve the performance of the weighted kNN

algorithm and it is even better if the cluster size is small.




                                               72
                                 REFERENCES


1. Eisenberg D, Marcotte E, Xenarlos I, Yeates T. Protein function in the post-
   genomic era. Nature 2000; 405: 823–26

2. Banks RE, Dunn MJ, Hochstrasser DF, Sanchez Jean, et al. Proteomics: new
   perspectives, new biomedical opportunities. The Lancet 2000; 356: 1749–56

3. Mahairas GG, Sabo PJ, Hickey MJ, Singh DC, Stover CK. Molecular analysis of
   genetic differences between Mycobacterium bovis BCG and virulent M bovis. J
   Bacteriol 1996; 178: 1274–82

4. Corbett JM, Why HJ, Wheeler CH, et al. Cardiac protein abnormalities in dilated
   cardiomyopathy detected by two-dimensional polyacrylamide gel electrophoresis.
   Electrophoresis 1998; 19: 2031–42

5. Knecht M, Regitz-Zagrosek V, Pleissner K-P, et al. Characterization of
   myocardial protein composition in dilated cardiomyopathy by twodimensional gel
   electrophoresis. Eur Heart J 1994; 15 (suppl D):37–44.

6. Pleissner KP, Soding P, Sander S, et al. Dilated cardiomyopathy associated
   proteins and their presentation in a WWW-accessible twodimensional gel protein
   database. Electrophoresis 1997; 18: 802–08.

7. Wheeler CH, Collins A, Dunn MJ, et al. Characterization of endothelial antigens
   associated with transplant-associated coronary artery disease. J Heart Lung
   Transplant 1995; 14: S188–97.

8. Alaiya AA, Franzen B, Auer G, Linder S. Cancer proteomics: from
   identificationb of novel markers to creation of artificial learning models for tumor
   classification. Electrophoresis 2000; 21: 1210–17.

9. Masters CL, Simms G, Weinman NA, et al. Amyloid plaque core protein in
   Alzheimer disease and Down syndrome. Proc Natl Acad Sci USA 1985; 82:
   4245–49.




                                        73
10. Glenner GG, Wong CW. Alzheimer’s disease and Down’s syndrome: sharing of a
    unique cerebrovascular amyloid fibril protein. Biochem Biophys Res Commun
    1984; 122: 1131–35.

11. Chen Y, Xu D. Genome-Scale Protein Function Prediction in Yeast
    Saccharomyces cerevisiae through integrating multiple sources of high-
    throughput data. Pacific Symposium on Biocomputing 2005; 10: 471-482

12. Pavlidis P, Weston J, Cai J, Grundy WN. Gene Functional Classification from
    heterogeneous data. In Proceedings of the Fifth International Conference on
    Computational Molecular Biology 2001 (RECOMB2001). 249-255

13. Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions in
    Yeast. Nature Biotechnology 2000; 18: 1257-61

14. Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T. Assesment of prediction
    accuracy of protein function from protein-protein interaction data Yeast 2001; 18:
    523-531

15. Clare A, King RD. Machine learning of functional class from phenotype data.
    Bioinformatics 2002; 18: 160-166

16. Letovsky S, Kasif S. Predicting protein function from protein/protein interaction
    data: a probabilistic approach. Bioinformatics 2003; 19 Suppl 1: i197-i204

17. Troyanskaya OG, Dolinski K, Owen A, Altman R, Botstein D. A Bayesian
    framework for combining heterogeneous data sources for gene function prediction
    (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci 2003; 100: 8348-53

18. Mount DW. Bioinformatics: Sequence and Genome Analysis, 2nd Edition

19. The Gene Ontology Consortium. Gene Ontology: tool for the unification of
    biology. Nature Genetics 2000; 25: 25-29

20. The Gene Ontology Documentation –
    http://www.geneontology.org/GO.contents.doc.shtml

21. Hennig S, Groth D, Lehrach H. Automated Gene Ontology annotation for
    anonymous sequence data. Nucleic Acids Research 2002; 31:3712-3715

22. Zehetner G: OntoBlast function. From sequence similarities directly to potential
    functional annotations by ontology terms. Nucleic Acids Research 2003; 31:3799-
    3803



                                       74
23. Martin DM, Berriman M, Barton GJ. GOtcha: a new method for prediction of
    protein function assessed by the annotation of seven genomes. BMC
    Bioinformatics 2003; 5:178

24. Xie H, Wasserman A, Levine Z, Novik A, Grebinskiy V, Shoshan A, Mintz L.
    Large-scale protein annotation through Gene Ontology. Genome Research 2002;
    12:785-794

25. Schug J, Diskin S, Mazzarelli J, Brunk BP, Stoeckert CJ Jr. Predicting gene
    ontology functions from ProDom and CDD protein domains. Genome Research
    2002; 12:648-655

26. Abascal F, Valencia A. Automatic annotation of protein function based on family
    identification. PROTEINS: Structure, Function, and Genetics 2003; 53:683-692

27. Jensen LJ, Gupta R, Staerfeldt HH, Brunak S. Prediction of human protein
    function according to Gene Ontology categories. Bioinformatics 2003; 19: 635-
    642

28. Vinayagam A, Konig R, Moormann J, Schubert F, Eils R, Glatting KH, Suhai S.
    Applying support vector machines for Gene Ontology based gene function
    prediction. BMC Bioinformatics 2004; 5:116

29. Babu M. Microarray Data Analysis – An Introduction to Microarray data analysis;
    225-249

30. Ge, H., Liu, Z., Church, G. M., and Vidal, M. Correlation between transcriptome
    and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet. 2001;
    29: 482-486

31. Jansen, R., and Gerstein, M. Analysis of the yeast transcriptome with structural
    and functional categories: characterizing highly expressed proteins. Nucl. Acids
    Res. 2000; 28: 1481-1488

32. Teichmann, S. A., and Madan Babu, M. Conservation of gene co-regulation in
    prokaryotes and eukaryotes. Trends Biotechnol 2002; 20: 407-410; discussion 410

33. van Noort, V., Snel, B., and Huynen, M. A. Predicting gene function by
    conserved co-expression. Trends Genet. 2003; 19: 238-242

34. Yeast proteome.
    ftp://ftp.expasy.org/databases/complete_proteomes/entries/eukaryota/

35. UniProt Knowledgebase: User Manual-Release 11.0 of 10-July 2007

                                      75
36. GO terms. ftp://ftp.geneontology.org/pub/go/ontology-archive/

37. Cho RJ, et al. A genome-wide transcriptional analysis of the mitotic cell cycle.
    Mol Cell 1998; 2:65-73

38. Spellman PT, et al. Comprehensive Identification of cellcycle regulated genes of
    the Yeast Saccharomyces Cerevisiae by Microarray Hybridization. Mol Biol of
    the Cell 1998; 9: 3273-97

39. Yeast cell cycle data. http://cellcycle-www.stanford.edu

40. NCBI BLAST Information.
    http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

41. Duan ZH, et al. The relationship between protein sequences and their gene
    ontology functions. BMC Bioinformatics 2006; 7:S11

42. Tatusova TA, Madden TL. Blast 2 Sequences, a new tool for comparing protein
    and nucleotide sequences. FEMS Microbiology Letters 1999; 174:247-250

43. AmiGO. http://www.godatabase.org/cgi-bin/amigo/go.cgi

44. Tamayo P, et al. Interpreting patterns of gene expression with self-organizing
    maps: Methods and application to hematopoietic differentiation. PNAS 1999; 96:
    2907-12




                                        76
                               APPENDIX



IMPLEMENTATION OF WEIGHTED KNN CLASSIFICATION ALGORITHM


#!/usr/bin/perl -w
#======================================================================
============================
# Saket Kharsikar
# Protein Function Prediction using GO Annotations
# Automated Weighted-KNN for entire proteone
#======================================================================
============================

use POSIX qw(log10);

$inProtFile = "YeastProteome2.dat";
$inScoreFile = "GOTree/GO_0003674Score.txt";
$inDataFile = "Yeastmodif.dat";
$inGroupFile = "GOgroups.txt";
$outReportFile = "Report_Level1_k4_wknn.txt";

open(INFILE,"<$inProtFile") || die("Could not open $inProtFile");
# P-values for protein pairs
print "Obtaining the yeast proteome from $inProtFile\n";
@prot=<INFILE>;
close(INFILE);

open(INFILE,"<$inScoreFile") || die("Could not open $inScoreFile");
# P-values for protein pairs
print "Obtaining P-Values from $inScoreFile\n";
@score=<INFILE>;
close(INFILE);

open(INFILE,"<$inDataFile") || die("Could not open $inDataFile");
# Yeast proteins and their GO nodes
print "Obtaining Yeast data from $inDataFile\n";
@data=<INFILE>;
close(INFILE);

open(INFILE,"<$inGroupFile") || die("Could not open $inGroupFile");
# GO Tree structure
print "Obtaining GO Tree structures from $inGroupFile\n";
@group=<INFILE>;
close(INFILE);
                                   77
open(OUTFILE,">$outReportFile") || die("Could not create
$outReportFile");      # Output report
%results=("correct",0,"incorrect",0,"failed",0);
$total=0;
foreach $queryprot(@prot)
{
  $total++;
  chomp $queryprot;
  print "Query # $total\tProcessing query $queryprot\n";
  $result=prediction($queryprot);
  $results{$result}++;
}
print(OUTFILE "\n\nTotal Predictions : \t$total\nCorrect Predictions :
\t$results{'correct'}\nIncorrect Predictions :
$results{'incorrect'}\nFailed Predictions : \t$results{'failed'}\n");
print("\n\nTotal Predictions : \t$total\nCorrect Predictions :
\t$results{'correct'}\nIncorrect Predictions :
$results{'incorrect'}\nFailed Predictions : \t$results{'failed'}\n");
close(OUTFILE);


#======================================================================
============================
# Subroutine to predict the function of the protein
#======================================================================
============================

sub prediction
{
  my($query_protein) = @_;
  my($K) = 10;
  my(%ClassifierCount)=();
  my(%MatchProtein)=();
  my($hit)=0;
  my($prev_protein)=0;
  my($prev_GO)=0;

  my(@Matches) = getMatch($query_protein);
  foreach $eachMatch(@Matches)
  {
    if($eachMatch eq "0")
    {
      print(OUTFILE "$query_protein : PREDICTION FAILED    - Invalid
query protein\n");
      return "failed";
    }
    my ($MProtein,$MScore)=split(/\t/,$eachMatch);
    $MatchProtein{$MProtein} = "$MScore";
  }
  my(@keys) = sort {$MatchProtein{$a} <=> $MatchProtein{$b}} (keys
%MatchProtein);
  if(scalar @keys < $K)
  { $K = scalar @keys; }
  for($n=0;$n<$K;$n++)

                                   78
 {
     $MProtein = $keys[$n];
     @GOnodes = getGOnode($MProtein);

     foreach $eachGOnode(@GOnodes)
     {
       if($eachGOnode eq "0")
       { $K++; last; }
       @primGOterms=getGOpath($eachGOnode);
       foreach $eachprimGOterm(@primGOterms)
       {
         if($eachprimGOterm eq "0")
         { $K++; last; }
         @GOPath = split(/\s/,$eachprimGOterm);
         @GOPath=reverse(@GOPath);

        if($prev_protein eq ($MProtein||0) && $prev_GO eq
($GOPath[1]||0))
          { next; }
        else
        {
          $sigGO = join(" ",keys(%ClassifierCount));
          $weight=-log10($MatchProtein{$MProtein});
          if($sigGO =~/$GOPath[1]/)
          {
$ClassifierCount{$GOPath[1]}=$ClassifierCount{$GOPath[1]}+$weight; }
          else
          { $ClassifierCount{$GOPath[1]}=$weight; }
        }
        $prev_protein=$MProtein;
        $prev_GO=$GOPath[1];
      }
    }
  }
  @keys = sort {$ClassifierCount{$b} <=> $ClassifierCount{$a}} (keys
%ClassifierCount);

  @Q_node=getGOnode($query_protein);
  foreach $eachQ_node(@Q_node)
  {
    if($eachQ_node eq "0")
    {
      print(OUTFILE "$query_protein : PREDICTION FAILED   - Query
protein GO node not existing\n");
      return "failed";
    }
    @Q_path=getGOpath($eachQ_node);
    foreach $eachQ_path(@Q_path)
    {
      if($eachQ_path eq "0")
      { last; }
      @Q_GOPath = split(/\s/,$eachQ_path);
      @Q_GOPath=reverse(@Q_GOPath);
      if($keys[0] eq $Q_GOPath[1])
      { $hit=1; }

                                    79
    }
  }
  if($hit==1)
  { print(OUTFILE "$query_protein : PREDICTION CORRECT\n"); return
"correct"; }
  else
  { print(OUTFILE "$query_protein : PREDICTION INCORRECT\n"); return
"incorrect"; }
}


#======================================================================
============================
# Subroutine to obtain significantly similar proteins using Score file
#======================================================================
============================

sub getMatch
{
  my($query)=@_;
  my($check)=0;
  my(@Match);

    foreach $score_line(@score)
    {
      chomp $score_line;
      if($score_line =~ /$query/)
      {
        $check =1;
        $match_line=$score_line;
        $match_line=~s/$query//gi;
        push(@Match,"$match_line");
      }
    }
    if($check==0)
    { push(@Match,"0"); }
    return @Match;
}


#======================================================================
============================
# Subroutine to obtain GO node terms of the protein from Yeast data
file
#======================================================================
============================

sub getGOnode
{
  my($protein)=@_;
  my($det)=0;

    foreach $data_line(@data)
    {
      chomp $data_line;

                                      80
     if($data_line=~/^AC.*$protein/ && $det==0)
     { $det=1; }
     elsif($det==1 && $data_line=~/^GO/)
     {
       $GO_line=$data_line;
       $GO_line=~s/GO:/ GO:/gi;
       my(@GOlist)=split(/\s/,$GO_line);
       shift(@GOlist);
       return @GOlist;
     }
     else
     { $det=0; }
    }
    push(@GOlist,"0");
    return @GOlist;
}

#======================================================================
============================
# Subroutine to obtain the various GO paths for a GO node term
#======================================================================
============================

sub getGOpath
{
  my ($GOnode)=@_;
  my (@GOTree)=(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
  my ($m)=1;
  my ($det)=0;
  my ($flag)=0;
  my ($currPath)=$GOnode;
  my ($currGO)=$GOnode;
  $j=0;
  $GOTree[0]=$GOnode;

    for($i=0;$i<$m;$i++)
    {
      $currGO=lastterm($GOTree[$i]);
      while($currGO ne "GO:0003674")
      {
        $currPath=$GOTree[$i];
        $currGO=lastterm($currPath);
        foreach $group_line(@group)
        {
          chomp $group_line;
          if($group_line=~/^GO_id: $currGO/)
          { $flag=1; $det=1; }
          elsif($flag==1 && $group_line=~/^is_a/)
          {
            $flag=0;
            @is_a_list=split(/\s/,$group_line);
            $num_is_a=scalar(@is_a_list)-2;
            $k=$num_is_a;
            $m=$m+$k-1;
            for($j=0;$j<$k;$j++)

                                     81
             {
                 if($GOTree[$j] eq "0" || $GOTree[$j] eq "$currPath")
                 { $GOTree[$j]="$currPath $is_a_list[($num_is_a+2)-($k-
$j)]";   }
                 else
                 { $k++; next; }
             }
             $currGO=lastterm($GOTree[$i]);
             last;
         }
       }
       if($det==0)
       { unshift(@GOTree,"0"); last; }
      }
    }
    for($i=$j;$j<15;$j++)
    { pop @GOTree; }
    return @GOTree;
}

#======================================================================
============================
# Subroutine to obtain the last term in any string
#======================================================================
============================

sub lastterm
{
  my($str)=@_;
  my(@substr)=split(/\s/,$str);
  my $numstr=scalar(@substr);
  return $substr[$numstr-1];
}




                                        82

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:12/1/2011
language:English
pages:92