A GENE ONTOLOGY BASED COMPUTATIONAL APPROACH
FOR THE PREDICTION OF PROTEIN FUNCTIONS
A Thesis
Presented to
The Graduate Faculty of The University of Akron
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
Saket Kharsikar
August 2007
A GENE ONTOLOGY BASED COMPUTATIONAL APPROACH
FOR THE PREDICTION OF PROTEIN FUNCTIONS
Saket Kharsikar
Thesis
Approved: Accepted:
____________________ ____________________
Advisor Department Chair
Dr. Dale H. Mugler Dr. Daniel B. Sheffer
____________________ ____________________
Co-Advisor Dean of the College
Dr. Zhong-Hui Duan Dr. George K. Haritos
____________________ ____________________
Committee Member Dean of the Graduate School
Dr. Francisco Moore Dr. George R. Newcome
____________________
Date
ii
ABSTRACT
Numerous genome projects have produced a large and ever increasing amount of
genomic sequence data. However, the biological functions of many proteins encoded by
the sequences remain unknown. Protein function annotation and prediction become an
essential and challenging task of post-genomic research. In this research, we present an
automated protein function prediction system based on a set of proteins of known
biological functions. The functions of the proteins are characterized with Gene Ontology
(GO) annotations. The prediction system uses a novel measure to calculate the pair-wise
overall similarity between protein sequences. The protein function prediction is
performed based on the GO annotations of similar sequences using a weighted k-nearest
neighbor method. We show the prediction accuracies obtained using the model organism
yeast (Sacchyromyces cerevisiae). The results indicate that the weighted k-nearest
neighbor method significantly outperforms the regular k-nearest neighbor method for
protein biological function prediction.
iii
ACKNOWLEDGEMENTS
First and foremost, I would like to thank my advisor, Dr Dale Mugler for his
constant support and co-operation all through my Master’s Thesis.
I would also like to thank my co-advisor, Dr Zhong-Hui Duan for giving me the
opportunity to work on this research project and her invaluable inputs in the entire course
of the project. This thesis would not have been possible without her guidance and
persistent help.
A special thanks to my committee- Dr Daniel Sheffer and Dr Francisco Moore for
their time and effort and especially for their invaluable suggestions in biostatistics.
I would like to take a chance to thank my project group Gayatri Paknikar,
Santhosh Tadavai, Neeraj Mogla and Satish Sangem for their help in one of the modules
of my thesis.
Lastly, I would like to express my gratitude towards my parents and all my
friends for their faith and who were always there for me all through the progress of my
thesis and eventually my degree.
iv
TABLE OF CONTENTS
Page
LIST OF TABLES........................................................................................................... viii
LIST OF FIGURES ........................................................................................................... ix
CHAPTER
I. INTRODUCTION ......................................................................................................1
1.1 Proteomics............................................................................................................1
1.2 New Biomedical Opportunities............................................................................3
1.2.1 Drug Development.................................................................................4
1.2.2 Infectious Diseases.................................................................................4
1.2.3 Heart Disease .........................................................................................5
1.2.4 Cancer ....................................................................................................6
1.2.5 Neurological Disorders ..........................................................................6
1.3 Objectives of the project ......................................................................................6
1.4 Hypothesis............................................................................................................7
II. LITERATURE REVIEW ...........................................................................................8
2.1 Protein function prediction ..................................................................................8
2.1.1 Existing approaches for function prediction ..........................................9
2.2 Sequence Alignment and Similarity ..................................................................12
2.2.1 Global Alignment.................................................................................13
v
2.2.2 Local Alignment ..................................................................................13
2.2.3 Semi-global Alignment........................................................................13
2.2.4 Methods for alignment.........................................................................14
2.3 Gene Ontology ...................................................................................................17
2.3.1 Formation of GO..................................................................................17
2.3.2 Three Categories of Gene Ontology ....................................................18
2.3.3 Standardized format .............................................................................22
2.3.4 GO is the way to go .............................................................................22
2.4 Data Classification .............................................................................................23
2.5 Microarrays and Gene Expression Data ............................................................24
2.5.1 Microarray technology.........................................................................24
2.5.2 Gene Expression Data..........................................................................28
2.6 Data Clustering ..................................................................................................31
2.6.1 Hierarchical clustering: agglomerative ................................................32
2.6.2 Hierarchical clustering: divisive ..........................................................33
2.6.3 Non-hierarchical clustering..................................................................34
2.6.4 Predicting protein interactions and protein functions ..........................36
III. MATERIALS AND METHODS..............................................................................38
3.1 Data set...............................................................................................................38
3.1.1 Complete yeast proteome.....................................................................38
3.1.2 GO Annotations ...................................................................................44
3.1.3 Yeast Gene Expression data.................................................................46
3.2 Data pre-processing ...........................................................................................48
vi
3.3 Sequence similarity scoring ...............................................................................49
3.3.1 BLAST Algorithm ...............................................................................49
3.4 Classification......................................................................................................51
3.5 Gene Expression Correlation and Clustering.....................................................54
3.6 A Novel Approach .............................................................................................56
IV. RESULTS AND DISCUSSIONS.............................................................................58
V. CONCLUSIONS.......................................................................................................71
REFERENCES .........................................................................................................73
APPENDIX IMPLEMENTATION OF WEIGHTED KNN ....................................77
CLASSIFICATION ALGORITHM
vii
LIST OF TABLES
Table Page
2.1 Representation of gene expression data ...................................................................29
3.1 Line types and line codes for Swiss-Prot sequence entry .........................................42
3.2 Yeast gene expression data .......................................................................................47
4.1 P-values for test of percentage accuracies ................................................................64
4.2 Cluster details for 4 X 3 clustering scheme ..............................................................67
4.3 Cluster details for 2 X 3 clustering scheme ..............................................................68
4.4 Cluster details for 2 X 5 clustering scheme ..............................................................69
viii
LIST OF FIGURES
Figure Page
1.1 Growth of GenBank database over the period 1982 – 2005.......................................2
2.1 Examples showing global and local alignment between two sequences ..................12
2.2 Cellular Component GO nodes.................................................................................20
2.3 Biological Process GO nodes....................................................................................20
2.4 Molecular Function GO nodes..................................................................................21
2.5 A schematic representation of a microarray chip .....................................................26
2.6 Schematic for the experimental setup for any microarray gene ..............................27
expression analysis
2.7 Various clustering methods.......................................................................................32
2.8 Self-Organizing Maps...............................................................................................35
3.1 BLAST Algorithm ....................................................................................................49
3.2 The graphical representation of a kNN Classifier algorithm with k=5.....................53
3.3 Schematic for the novel approach.............................................................................56
4.1 Numbers of GO groups at different levels of the gene ontologies ...........................59
4.2 Average size of GO groups at different levels of the ontologies..............................59
4.3 GO Tree for molecular function ontology ................................................................61
4.4 Percentage Accuracy at first level.............................................................................63
4.5 Percentage Accuracy at second level........................................................................63
ix
4.6 Percentage Accuracy at third level ...........................................................................64
4.7 GeneCluster – Web tool for clustering based on SOM.............................................65
4.8 12 clusters from a 4 X 3 clustering scheme ..............................................................66
4.9 Percentage accuracy for 4 X 3 clustering scheme ....................................................67
4.10 Percentage accuracy for 2 X 3 clustering scheme ....................................................68
4.11 Percentage accuracy for 2 X 5 clustering scheme ....................................................69
x
CHAPTER I
INTRODUCTION
Biomedical research over the last couple of years has made tremendous progress
in our understanding of biology and medicine. The recent genomic sequencing of human,
mouse and other organisms and other high-throughput studies based on microarray
technology have been generating massive amounts of sequence data. Faced with the
avalanche of genomic sequences and gene expression data, biological scientists are
confronting a challenging prospect: piles of information but only flakes of knowledge.
1.1 Proteomics
Proteins are the major components of living organisms and constitute more than
25% by dry weight of a typical cell. These proteins perform a variety of functions like
catalysis, immune recognition, cell adhesion, signal transduction, sensory capabilities,
transport, movement and cell organization. Chemically, proteins are linearly oriented
hetero-polymers of amino acids; determined by the sequence of nucleotides in the
corresponding gene. Proteins are the main catalysts, structural elements, signaling
messengers and molecular machines of the biological tissues. Functional proteomics
targets learning the functions of protein molecules. All the primary functional knowledge
about the proteins comes from some kind of biochemical, genetic or structural laboratory
1
experiment on an individual protein. But with the explosion of protein sequence
information and an outburst of the amount of new proteins being sequenced and isolated,
it is not feasible to perform experiments on each and every protein individually. Hence it
is necessary to develop an expansive system for the functional characterization of the
proteins. Scientists are looking at other computational approaches to tackle the problem.
Fig. 1.1 Growth of GenBank database over the period 1982 – 2005
Courtesy GenBank Statistics at National Center for Biotechnology Information (NCBI)
There have been some computational approaches already suggested to predict
functions using the ever increasing abundance of sequence data produced by high-
throughput means like mass spectrometry and microarray gene expression profiling. The
major challenge for biologists is to use this genetic information available from the
genome sequencing projects - not just to decode the amino acid sequences but also to find
2
out their functions. Proteomics based approaches initially use computer-based similarity
searches against proteins of known function and the results may allow some broad
inferences to be made about the possible functions, which can then be explored
experimentally.
Once a function of an individual protein is found and has been assigned to it, one
can search for other proteins whose amino acid sequences are similar to that of the
original protein whose function is already known. This approach is known as the
homology method and is widely used to extend the knowledge of protein function from
one protein to its cousin proteins, which are presumed to be descended from the same
common ancestral protein [1]. In sequence similarity based approaches, the underlying
assumption is that sequence similarity implies functional similarity. This assumption is
true in many cases. This makes it a popular and widely used technique although it has
been noted that matching sequences do not always infer similar functions and also that
sometimes proteins with dissimilar sequences do possess similar functions.
1.2 New Biomedical Opportunities
The greatest influence of proteomics-based approaches on biomedical research
has not yet been encountered; however, new progress is being achieved. Such approaches
offer a potential in resolving the complex biological intricacies such as the nature of
particular molecular complexes or pathways in disease pathogenesis [2].
3
1.2.1 Drug Development
Drug development is generally associated with the desire to enhance or diminish a
specific activity involved in disease pathogenesis or treatment-involved side-effects.
Most of the drugs employ their effects on the proteins. The strategy of working forward
from the gene is used wherein a specific genetic lesion is identified and the consequent
modifications in protein structure, function and expression are clarified, so that a drug
can be rationally designed to compensate or fix such aberrations. The challenge in
proteomics-based approaches still remains in identifying the target molecules. Cell-
mapping proteomics and expression proteomics are some of the tools used for target
analysis. In addition to target selection, target validation and toxicity studies are also
important.
1.2.2 Infectious diseases
Identification of proteins produced by microorganisms is aided by the small
number of genes and the completion of genome sequencing for many microorganisms.
The aim for these studies has been the search for new diagnostic markers, candidate
antigens for vaccines and determination of virulence. Myobacterium tuberculosis is
probably the most studied microorganism; causing tuberculosis and leading to more than
2 million deaths each year [3]. So an international priority is to improve diagnosis and
develop a more effective vaccine for the same. The findings from the experiments
involving examination of genetic differences between virulent M tuberculosis, virulent M
bovis and the avirulent M bovis BCG are not only useful in identifying potential
4
regulators of virulence but may also provide the basis for discriminating BCG
immunization and presence of virulent tuberculosis [3].
1.2.3 Heart disease
Heart disease, resulting from systemic disease or specific heart-muscle disease, is
one of the leading causes of morbidity and mortality in more developed countries. The
pathogenesis of cardiac dysfunction is still considerably unknown. But a proteomics-
based approach in identifying overall changes in protein expression in heart disease and
heart failure may provide new insights into the cellular mechanisms involved in cardiac
dysfunction. These approaches will further constitute new diagnostic markers and
therapeutic opportunities. The proteomic approach so far has been on dilated
cardiomyopathy, which certainly has multiple causes. The combined results of all these
studies [4-6] have shown that the expression of some 100 cardiac proteins is significantly
different from normal in dilated cardiomyopathy and most of the proteins are less
abundant in the diseased than the normal heart. Most of the identified proteins can be
grouped into three broad functional categories: cytoskeletal and myofibrillar proteins,
proteins associated with mitochondria and energy production, and proteins associated
with stress responses. In addition to the global proteomic approach, cardiac antigens that
elicit specific antibody responses in vivo are being identified. Several of these antigens
may be involved in processes of acute and chronic rejection after cardiac transplantation
and are currently being investigated as potential non-invasive markers [7].
5
1.2.4 Cancer
Although proteomics-based studies are chiefly concerned with the identification
of novel antigens or markers for prognostic, diagnostic or therapeutic use, molecules and
processes implicated in carcinogenesis are also being investigated increasingly. Most
tumor markers in current use were identified using protein-based approaches dating back
to the 1800s. Proteomics-based studies of many tumor types are now underway and they
are likely to benefit from the use of complimentary techniques such as laser capture
micro-dissection to isolate malignant cells for electrophoretic analysis, thus facilitating
the discovery of tumor-specific proteins [8].
1.2.5 Neurological disorders
Alzheimer’s disease is the most common single cause of dementia. Amino acid
sequencing and identification of common protein fragments form the basis for cloning of
the Amyloid Precursor Protein (APP) gene- a component of the plaque found in the
brains of Alzheimer’s disease patients [9, 10]. Many studies of APP have used two-
dimensional electrophoretic and mass spectrometric techniques and is likely to remain a
principal approach in the future.
1.3 Objectives of the project
1. To study the correlation between protein sequence similarity and function
similarity for the yeast proteome in the context of molecular function ontology.
2. To study the accuracy of function prediction in case of the kNN and the weighted
kNN classifier algorithms.
6
3. To study the relation between gene expression patterns and the functions of the
proteins using microarray data.
4. To design a novel approach by combining the sequence similarity score and the
gene expression correlation coefficient to predict the protein function.
5. To study the effect of clustering of gene expression data on the performance of
the weighted kNN classifier.
1.4 Hypothesis
Null Hypothesis:
The means of the percentage prediction accuracies for the classical kNN and the means
for the weighted kNN classification algorithm calculated for different parameter settings
are not significantly different, i.e. μ1 = μ2
The performance of the classical kNN classification algorithm based on the prediction
accuracy is not significantly different from that of the weighted kNN classification
algorithm.
Alternate Hypothesis:
The means of the percentage prediction accuracies for the classical kNN and the means
for the weighted kNN classification algorithm calculated for different parameter settings
are significantly different, i.e. μ1 ≠ μ2
The performance of the classical kNN classification algorithm based on the prediction
accuracy is significantly different from that of the weighted kNN classification algorithm.
7
CHAPTER II
LITERATURE REVIEW
2.1 Protein Function Prediction
The early approaches to predicting protein function were experimental and
usually focused on a specific target gene or protein, or a small set of proteins forming
natural groups such as protein complexes. However, irrespective of the details, these
approaches are low-throughput because of the huge experimental and human effort
required in analyzing a single gene or protein. This has resulted in a continually
expanding sequence-function gap for the discovered proteins. In an attempt to close this
gap, numerous high-throughput experimental procedures have been invented to
investigate the mechanisms leading to the accomplishment of a protein’s function [11].
These procedures have generated a wide variety of useful data that ranges from simple
protein sequences to complex high-throughput data. These data offer different types of
insights into a protein’s function and related concepts. Furthermore, recent years have
seen the recording of this data in very standardized and professionally maintained
databases; some of them listed under.
1. GenBank at the National Center of Biotechnology Information, National Library
of Medicine, Washington D.C. accessible from
http://www.ncbi.nlm.nih.gov/Entrez
8
2. European Molecular Biology Laboratory (EMBL) Outstation at Hixon, England
http://www.ebi.ac.uk/embl/index.html
3. DNA Databank of Japan (DDBJ) at Mishima, Japan
http://www.ddbj.nig.ac.jp/
4. Protein International Resource (PIR) database at the National Biomedical
Research Foundation in Washington D.C., an annotated database
http://www-nbrf.georgetown.edu/pirwww/
5. The SwissProt protein sequence database at ISREC, Swiss Institute for
Experimental Cancer Research in Epalinges/Lausanne, an annotated database
http://www.expasy.ch/cgi-bin/sprot-search-de
6. The Sequence Retrieval System (SRS) at the European Bioinformatics Institute
http://srs6.ebi.ac.uk
2.1.1 Existing approaches for function prediction
In the domain of automated function prediction, sequences have been heavily
utilized, in both direct homology-based and indirect subsequence and feature-based
approaches. Specifically, techniques that predict protein function from sequence can be
categorized into three classes, namely, sequence homology based approaches,
subsequence based approaches and feature based approaches.
Sequence homology-based approaches:
In homology-based approaches, proteins are said to have a common ancestry if they have
a high similarity score for their amino acid sequences. Sequence alignment is performed
to get a similarity score and a prediction model is built using these entire sequences.
9
Subsequence-based approaches:
Many times not the entire protein sequences, but only some segments of it are
important for determining the function of a given protein. Consequently, the approaches
in this category treat these segments or subsequences as features of a protein sequence
and construct models for the mapping of these features to protein function. These models
are then used to predict the function of a query protein.
Feature-based approaches:
The feature-based approaches attempt to exploit the perspective that the amino
acid sequence is a unique characterization of a protein, and determine several of its
physical and functional features. These features are used to construct a predictive model
which can map the feature-value vector of a query protein to its function.
There are some computational approaches that have been developed to use the
sequence data for protein function prediction. These methods were tested on the
Saccharomyces Genome Database and also incorporated the Gene Ontology (GO)
annotations for prediction.
Cluster analysis of the gene-expression profiles is a common approach for
predicting functions based on the assumption that genes with similar functions are likely
to be co-expressed. Pavlidis et al. [12] considered the problem of inferring gene
functional classifications from heterogeneous datasets consisting of DNA microarray
expression measurements and phylogenetic profiles from the whole genome sequence
comparisons. They demonstrated the application of the support vector machines (SVM)
10
learning algorithm to this functional inference task. Palvidis et al. used a collection of
DNA microarray hybridization experiment data for 2465 yeast genes along with a
phylogenetic profile for each gene and demonstrated the use of SVM learning algorithm
for heterogeneous data..
Using protein-protein interaction data to assign function for novel proteins is
another approach. Proteins often interact with one another in an interaction network to
achieve a common objective. It is therefore possible to infer the functions of proteins
based on the functions of their interaction partners, also known as “guilt by association”.
Schwikowski et al. [13] applied a neighbor-counting method in predicting the function.
They assigned function to an unknown protein based on the frequencies of its neighbors
having certain functions.
Hishigaki et al. [14] enhanced the performance of the neighbor counting method
based on the protein interaction networks by using χ2 statistics values instead of the raw
number of the neighboring members. Both these approaches give equal significance to
all the functions contributed by the protein neighbors in the interaction network.
Other function prediction methods using high-throughput data include machine-learning
approach and Markov random fields. Clare and King [15] used the C4.5 machine learning
algorithm which was adapted to the analysis of phenotype. The output of the algorithm is
a decision tree or equivalently a set of symbolic rules which allow the output to be
interpreted and compared with existing biological data.
Letovsky and Kasif [16] developed a method of assigning functions based on a
probabilistic analysis of graph neighborhoods in a protein-protein interaction network.
The method exploits the fact that graph neighbors are more likely to share functions than
11
nodes which are not neighbors and a binomial model of local neighbor function labeling
probability is combined with a Markov random field propagation algorithm to assign
function probabilities for proteins in the network.
MAGIC (Multisource Association of Genes by Integration of Clusters) approach
to combine heterogeneous data for function assignment has been applied in yeast by
Troyanskaya et al. [17]. MAGIC is a general framework that uses formal Bayesian
reasoning to integrate heterogeneous types of high-throughput biological data for
accurate gene function prediction.
2.2 Sequence Alignment and Similarity
Sequence alignment is the procedure of comparing two (pair-wise alignment) or
more (multiple sequence alignment) DNA or protein sequences by searching for a series
of individual characters or character patterns that are in the same order in the sequences.
In an optimal alignment, non-identical characters and gaps are placed to bring as many
identical or similar characters as possible into the vertically matching registers.
Sequences that can be readily aligned in this manner are said to be similar.
There are two main types of pair-wise sequence alignments- global and local.
Fig. 2.1 Examples showing global and local alignment between two sequences
Courtesy NCBI BLAST [40].
12
2.2.1 Global Alignment
The global alignment is stretched over the entire sequence length to include as
many amino acids as possible up to and including the sequence ends. A global alignment
is made possible by including gaps either within the middle of the alignment or at either
end of one or both sequences. A global alignment might misalign some areas so that more
amino acids along the entire sequence length can be matched. Sequences that are quite
similar and approximately the same length are suitable candidates for global alignment.
The Needleman-Wunsch algorithm is used to produce global alignments between pairs of
DNA or protein sequences.
2.2.2 Local Alignment
In a local alignment, the alignment stops at the ends of regions of strong similarity
and a much higher priority is given to finding these local regions than to extending the
alignment to include more neighboring amino acid pairs. This type of alignment favors
finding conserved nucleotide patterns in DNA sequences or amino acid domains in
protein sequences. Local alignments are more suitable for aligning sequences that are
similar along some of their lengths but dissimilar in others, sequences that differ in length
or sequences that share a conserved region or domain. The Smith-Waterman algorithm is
used to produce local alignments between pairs of DNA or protein sequences.
2.2.3 Semi-global Alignment
Hybrid methods, known as semi-global or "glocal" methods, attempt to find the
best possible alignment that includes the start and end of one or the other sequence. This
13
can be especially useful when the downstream part of one sequence overlaps with the
upstream part of the other sequence. In this case, neither global nor local alignment is
entirely appropriate: a global alignment would attempt to force the alignment to extend
beyond the region of overlap, while a local alignment might not fully cover the region of
overlap.
Sequence alignment is useful for discovering functional, structural and
evolutionary information in DNA or protein sequences. It is important to obtain the best
possible, so-called “optimal” alignment to discover this information. Sequences that are
very much alike, probably have the same function, be it a regulatory role in the case of
similar DNA molecules or a similar biochemical function and three-dimensional structure
in the case of proteins. Also, if two sequences from different organisms are similar, there
may have been a common ancestor sequence; the sequences are then defined as being
homologous. Their alignment indicates the changes that could have occurred between the
homologous sequences and a common ancestor sequence during evolution. With the
advent of genome analysis and large scale sequence comparisons it becomes important to
recognize that sequence similarity may be an indicator of several possible types of
ancestor relationships; or there may not exist any ancestor relationship at all.
2.2.4 Methods for alignment
Alignment of two sequences is performed using the following methods [18].
14
Dot Matrix analysis
A dot matrix analysis is primarily a method for comparing two sequences to look
for possible alignment of characters between the sequences. Since this method displays
graphically any possible sequence alignments as diagonals on the matrix, it can readily
reveal the presence of insertions/deletions. The method is also used for direct or inverted
repeats in protein and DNA sequences and for predicting regions in RNA that are self-
complimentary.
The major advantage of this method is that all possible matches of residues
between two sequences are found, leaving the investigator the choice of identifying the
most significant ones by examination of the dot matrix for long runs of matches, which
appear as diagonals. The major limitation of this method is that most of the dot matrix
computer programs do not show an actual alignment of the sequences.
Dynamic Programming approach
The dynamic programming approach; first used for global alignment of protein
sequences by Needleman and Wunsch (1970) and for local alignment by Smith and
Waterman (1981); provides one or more alignments of the sequences. The algorithm
generates an alignment by starting at the ends of two sequences and attempting to match
all possible pairs of characters between the sequences and by following a scoring scheme
for matches, mismatches and gaps. This procedure generates a matrix of numbers that
represents all possible alignments between the sequences, where the highest set of
sequential scores in the matrix defines an optimal alignment.
15
For proteins, an amino acid substitution matrix, such as the Dayhoff Percent
Accepted Mutation Matrix 250 (PAM250) or Blosum Substitution Matrix (BLOSUM62)
is used to score matches and mismatches. Similar matrices are available for aligning
DNA sequences.
The dynamic programming algorithm is guaranteed in a mathematical sense to
provide the optimal alignment for a given set of user-defined scoring matrix and gap
penalties. However the algorithm can be slow due to the very large number of
computational steps required, which increase approximately as the square or cube of the
sequence lengths.
Heuristic Methods
The word or k-tuple methods are used by the FASTA and BLAST algorithms,
which align two sequences very quickly by first searching for identical short stretches of
sequences (called words or k-tuples) and then joining these words into an alignment by
the dynamic programming method. These methods are fast enough to be suitable for
searching an entire database for the sequence that aligns best with an input query
sequence. The FASTA and BLAST methods are heuristic; that is an empirical method of
computer programming in which rules of thumb are used to find solutions and feedback
is used to improve performance. These methods are reliable in a statistical sense and
usually provide the best scoring alignment that is possible.
16
2.3 Gene Ontology
2.3.1 Formation of GO
The exponential growth in the availability of molecular sequences, particularly
entire genomic sequences, has transformed both the theory and practice of experimental
biology and medicine. There was a time when the biologists and other research scientists
in the medical field characterized proteins by their diverse activities and abundances and
geneticists classified genes based on the phenotypes of their mutations. But today all
biologists and the other related research community acknowledge that it is likely that
there is just a single limited universe of genes and proteins, most of which are conserved
in most of the living cells. This acknowledgement has fueled a grand unification of
biology. Information about the biological role of such a shared protein in one organism
can certainly shed some light and also at times provide concrete inference of its role in
some other organism.
This progress in the approach that the scientific community describes and
conceptualizes shared biological elements has not been able to keep up with high
throughput sequencing. As a result, the current systems of nomenclature for genes and
their products remain divergent even with the highly appreciated underlying similarities.
This has led to a lack of progress in the interoperability of genomic databases. This was
the basis of the formation of the Gene Ontology (GO) Consortium [19].
The GO project is a collaborative effort to address the need for consistent
descriptions of gene products in different databases. The project began in 1998 as
collaboration between three model organism databases –
1) FlyBase (Drosophila) - database for the fruitfly Drosophila melanogaster
17
2) Saccharomyces Genome Database (SGD) - database for the budding yeast
Saccharomyces cerevisiae
3) Mouse Genome Database (MGD) - databases for the mouse Mus musculus
Since then, the GO Consortium has grown to include many databases, including
several of the world's major repositories for plant, animal and microbial genomes. The
GO project has developed three structured controlled vocabularies (ontologies) that
describe gene products in terms of their associated biological processes, cellular
components and molecular functions in a species-independent manner. There are three
separate aspects to this effort: first, the development and maintenance of the ontologies
themselves; second, the annotation of gene products, which entails making associations
between the ontologies and the genes and gene products in the collaborating databases;
and third, development of tools that facilitate the creation, maintenance and use of
ontologies [20].
2.3.2 Three Categories of Gene Ontology
The GO annotations are based on the following three principles.
1. Cellular component
Cellular component refers to the place in the cell where a gene product is active.
Its ontology describes locations, at the levels of sub cellular structures and
macromolecular complexes. Cellular component includes such terms as
‘ribosome’ or ‘proteasome’ specifying where multiple gene products would be
found. It also includes terms such as ‘nuclear membrane’ or ‘Golgi apparatus’.
18
2. Biological process
Biological process refers to a biological objective to which to which the gene or
gene product contributes. A process is accomplished via one or more ordered
assemblies of molecular functions. They often involve a physical or chemical
transformation, meaning that something goes into a process and some other thing
comes out of it. ‘Cell growth and maintenance’ or ‘signal transduction’ are some
examples of high-level or broad biological process terms; whereas ‘translation’ or
‘pyrimidine metabolism’ are a couple of low-level or more specific biological
process terms.
3. Molecular function
Molecular function refers to the biochemical activity of a gene product and also
the capability that a gene product carries as a potential. These may include
transporting things around, binding to things, holding things together and
changing one thing into another. This is different from the biological processes
the gene product is involved in, which involve more than one activity. Examples
of broad functional terms are ‘enzyme’, ‘transporter’ or ‘ligand’. Examples of
narrower functional terms are ‘adenylate cyclase’ or ‘toll receptor ligand’.
Cellular components, biological processes and molecular functions are all
attributes of genes or gene products and each of them may be assigned independently. A
gene product might be associated with or located in one or more cellular components, it is
19
Fig. 2.2 Cellular Component GO nodes
Courtesy Xie et al. [24]
Fig. 2.3 Biological Process GO nodes
Courtesy Xie et al. [24]
20
Fig. 2.4 Molecular Function GO nodes
Courtesy Xie et al. [24]
active in one or more biological processes, during which it performs one or more
molecular functions.
The relationships between a gene product to cellular component, biological
process and molecular function are one-to-many, reflecting the biological reality that a
particular protein may function in several processes, contain domains that carry out
diverse molecular functions and participate in alternative interactions with other proteins,
organelles or locations in the cells.
For example, the gene product ‘cytochrome c’ can be described by the molecular
function term ‘oxidoreductase activity’, the biological process terms ‘oxidative
phosphorylation’ and ‘induction of cell death’, and the cellular component terms
‘mitochondrial matrix’ and ‘mitochondrial inner membrane’.
21
2.3.3 Standardized format
The ontologies constituting GO are modeled as a generic category of graphs
known as directed acyclic graphs (DAGs), which have numerous applications in
computer science, such as Bayesian networks and parse trees created by compilers. Each
node in these graphs represents a specific functional label and is assigned a unique GO id
of the form GO: XXXXXXX, and each edge represents either an is:a or a part:of
relationship. This well-defined structure makes GO easily usable by both humans and
computers [20].
2.3.4 GO is the way to go
Since the establishment of GO, many ontology-based sequence annotation
approaches have been developed including several web-based automated GO annotation
software tools. These attempts typically involve a search of homologous proteins in GO-
mapped databases including Genbank and Swiss-Prot.
Hennig et al’s GOblet [21] is a fully integrated local analysis system comprising
of a GO parser which maps the GO hierarchy onto a set of linearized trees. Every query
sequence generates a complete set of relevant GO Ids obtained from BLAST and this set
is used to build a summary tree. The total counts per GO Id are also given to identify the
significant GO terms.
Zehetner’s OntoBlast [22] presents a list of homologues together with their GO
terms. It uses genome data from nine species preprocessed for this tool. In this tool
available online, the use of BLAST is prominent. E-values signifying the similarity of the
22
gene names in the list are used along with weighting numbers to sort the term list. They
give an indication of how strong the evidence for a term is relative to the other terms.
Martin et al’s GOtcha searches a set of seven model genomes and returns scored
matches [23].
Xie et al’s [24] GO Engine combines sequence homology search based on GO
annotations with text information analysis and cellular localization prediction to increase
the annotation accuracy.
Schug et al [25] developed a rule-based function prediction method based on the
intersection of GO terms that contain protein domain at different similarity levels.
Abascal et al [26] presented an automatic annotation method based on protein
family identification.
Jensen et al [27] used neural networks for the prediction while Vinayagam et al.
[28] used support vector machines to increase the performance of the prediction tool.
The appeal of these approaches is that they can directly assign a biological
meaning to an uncharacterized protein sequence.
2.4 Data Classification
Classification is defined as a procedure in which individual items are placed into
groups based on quantitative information on one or more characteristics inherent in the
items and based on a training set of previously labeled items.
The goal of classification is to build a set of models that can correctly predict the class of
the different objects. The input to these methods is a set of objects (training data), the
classes which these objects belong to (dependent variables) and a set of variables
23
describing different characteristics of the objects (independent variables). Once such a
predictive model is built, it can be used to predict the class of the objects for which class
information is not known a priori. The key advantage of supervised learning methods
over unsupervised methods like clustering is that by having an explicit knowledge of the
classes the different objects belong to, these algorithms can perform an effective feature
selection if that leads to better prediction accuracy.
Some of the major classification models are listed below.
1. Classification by decision tree induction
2. Bayesian Classification
3. Neural Networks
4. Support Vector Machines (SVM)
5. Classification Based on Associations
6. kNN Classification
2.5 Microarrays and Gene Expression
2.5.1 Microarray technology
The traditional approach to genomic research was focused on the collection and
local examination of data on single genes. Today, microarray technologies have made it
possible to monitor the expression levels for tens of thousands of genes in parallel. A
microarray works by exploiting the ability of a given mRNA molecule to bind
specifically to, or hybridize to, the DNA template from which it originated. By using an
array containing many DNA samples, scientists can determine, in a single experiment,
24
the expression levels of hundreds or thousands of genes within a cell by measuring the
amount of mRNA bound to each site on the array. With the aid of a computer, the amount
of mRNA bound to the spots on the microarray is precisely measured, generating a
profile of gene expression in the cell.
A microarray is typically a glass slide on to which DNA molecules are fixed in an
orderly manner at specific locations called spots (or features). A microarray may contain
thousands of spots and each spot may contain a few million copies of identical DNA
molecules that uniquely correspond to a gene as shown in Fig. 2.5. The DNA in a spot
may either be genomic DNA or short stretch of oligo-nucleotide strands that correspond
to a gene. The spots are printed on to the glass slide by a robot or are synthesized by the
process of photolithography. Microarrays may be used to measure gene expression in
many ways, but one of the most popular applications is to compare expression of a set of
genes from a cell maintained in a particular condition (condition A) to the same set of
genes from a reference cell maintained under normal conditions (condition B).
Fig. 2.6 gives a general picture of the experimental steps involved. First, RNA is
extracted from the cells. Next, RNA molecules in the extract are reverse transcribed into
cDNA by using an enzyme reverse transcriptase and nucleotides labeled with different
fluorescent dyes. For example, cDNA from cells grown in condition A may be labeled
with a red dye and from cells grown in condition B with a green dye. Once the samples
have been differentially labeled, they are allowed to hybridize onto the same glass slide.
At this point, any cDNA sequence in the sample will hybridize to specific spots on the
glass slide containing its complementary sequence. The amount of cDNA bound to a spot
25
Fig. 2.5 A schematic representation of a microarray chip [29]
will be directly proportional to the initial number of RNA molecules present for that gene
in both samples. Following the hybridization step, the spots in the hybridized microarray
are excited by a laser and scanned at suitable wavelengths to detect the red and green
dyes. The amount of fluorescence emitted upon excitation corresponds to the amount of
bound nucleic acid. For instance, if cDNA from condition A for a particular gene was in
greater abundance than that from condition B, one would find the spot to be red. If it was
the other way, the spot would be green. If the gene was expressed to the same extent in
both conditions, one would find the spot to be yellow, and if the gene was not expressed
26
in both conditions, the spot would be black. Thus, what is seen at the end of the
experimental stage is an image of the microarray, in which each spot that corresponds to
a gene has an associated fluorescence value representing the relative expression level of
that gene.
Fig. 2.6 Schematic for the experimental setup for any microarray gene expression
analysis [29]
The microarray is scanned following hybridization and a TIFF image file is
normally generated. Once image generation is completed, the image is analyzed to
identify spots. We saw that the relative expression level for a gene can be measured as
27
the amount of red or green light emitted after excitation. The most common metric used
to relate this information is called expression ratio. It is denoted here as Tk and defined as:
Rk
Tk =
Gk
for each gene k on the array, where Rk represents the spot intensity metric for the test
sample and Gk represents the spot intensity metric for the reference sample. This
expression ratio undergoes transformation and normalization to bring out the differential
gene expression [29].
2.5.2 Gene Expression Data
The processed data, after the normalization procedure, can then be represented in
the form of a matrix, often called gene expression matrix; the different forms of which
are shown in Table 2.1. Each row in the matrix corresponds to a particular gene and each
column could either correspond to an experimental condition or a specific time point at
which expression of the genes has been measured. The expression levels for a gene
across different experimental conditions are cumulatively called the gene expression
profile, and the expression levels for all genes under an experimental condition are
cumulatively called the sample expression profile. Once we have obtained the gene
expression matrix, additional levels of annotation can be added either to the gene or to the
sample. For example, the function of the genes can be provided, or the additional details
on the biology of the sample may be provided, such as disease state or normal state.
28
Table 2.1 Representation of gene expression data [29]
To make any meaningful comparison or biological analysis, one should know what the
data in the gene expression matrix represents. Expression data can be represented in five
different ways, which are described below:
1. Absolute measurement
In the case of an absolute measurement, each cell in the matrix will represent the
expression level of the gene in abstract units.
29
2. Relative measurement or expression ratio
In the case of a relative measurement or representations involving expression
ratio, the expression level of a gene in abstract units is normalized with respect to
its expression in a reference condition. This gives the expression ratio of the gene
in relative units. Note that in such cases, a ratio of 4000/100 will lead to the same
result as 40/10.
3. log2(expression ratio)
In the case of tables representing the log2(expression ratio) values, information on
upregulation and down-regulation is captured and is mapped in a symmetric
manner. For example, 4-fold up-regulation maps to log2(4) = 2 and a 4-fold
down-regulation maps to log2(1/4) = -2.
4. Discrete values
Another way of representing information is to convert to discrete numbers the
values in the tables mentioned above. In the case of converting the absolute
measurement to discrete numbers, a binary expression matrix of 1 and 0 can be
used, where 1 means that the gene is expressed above a user defined threshold,
and 0 means that the gene is expressed below this threshold. In the case of making
the relative expression tables or log2 (expression ratio) tables discrete, values can
be divided into 3 classes, +1, 0 and –1, where +1 represents a gene that is
positively regulated, 0 represents a gene that is not differentially regulated and –1
represents a gene that is repressed.
30
5. Representation of expression profiles as vectors
So far we have seen how individual cells in the gene expression matrix can be
represented. Similarly, an expression profile (of a gene or a sample) can be
thought of as a vector and can be represented in vector space. For example, an
expression profile of a gene can be considered as a vector in n dimensional space
(where n is the number of conditions), and an expression profile of a sample with
m genes can be considered as a vector in m dimensional space (where m is the
number of genes).
2.6 Data Clustering
One of the goals of microarray data analysis is to cluster genes or samples with
similar expression profiles together to make meaningful biological inference about the set
of genes or samples. Clustering is one of the unsupervised approaches to classify data
into groups of genes or samples with similar patterns that are characteristic to the group.
Clustering methods can be hierarchical, that is grouping objects into clusters and
specifying relationships among objects in a cluster, resembling a phylogenetic tree or
non-hierarchical, that is grouping into clusters without specifying relationships between
objects in a cluster as schematically represented in Fig. 2.6.
31
Fig. 2.7 Various clustering methods
2.6.1 Hierarchical clustering: agglomerative
In the case of a hierarchical agglomerative clustering, the objects are successively
fused until all the objects are included. For a hierarchical agglomerative clustering
procedure, each object is considered as a cluster. The first step is the calculation of
pairwise distance measures for the objects to be clustered. Based on the pairwise
distances between them, objects that are similar to each other are grouped into clusters.
After this is done, pairwise distances between the clusters are re-calculated, and clusters
that are similar are grouped together in an iterative manner until all the objects are
included into a single cluster. This information can be represented as a dendrogram,
where the distance from the branch point is indicative of the distance between the two
clusters or objects. Comparison of clusters with another cluster or an object can be
carried out using four approaches
32
1. Single linkage clustering (Minimum distance)
In single linkage clustering, distance between two clusters is calculated as the
minimum distance between all possible pairs of objects, one from each cluster. This
method has an advantage that it is insensitive to outliers. This method is also known
as the nearest neighbor linkage.
2. Complete linkage clustering (Maximum distance)
In complete linkage clustering, distance between two clusters is calculated as the
maximum distance between all possible pairs of objects, one from each cluster. The
disadvantage of this method is that it is sensitive to outliers. This method is also
known as the farthest neighbor linkage.
3. Average linkage clustering
In average linkage clustering, distance between two clusters is calculated as the
average of distances between all possible pairs of objects in the two clusters.
4. Centroid linkage clustering
In centroid linkage clustering, an average expression profile (called a centroid) is
calculated.
33
2.6.2 Hierarchical clustering: divisive
Hierarchical divisive clustering is the opposite of the agglomerative method,
where the entire set of objects is considered as a single cluster and is broken down into
two or more clusters that have similar expression profiles. After this is done, each cluster
is considered separately and the divisive process is repeated iteratively until all objects
have been separated into single objects.
2.6.3 Non-hierarchical clustering
Non-hierarchical clustering requires predetermination of the number of clusters.
Non-hierarchical clustering then groups existing objects into these predefined clusters
rather than organizing them into a hierarchical structure.
1. Non-hierarchical clustering: K-means
K-means is a popular non-hierarchical clustering method. In K-means clustering,
the first step is to arbitrarily group objects into a predetermined number of clusters.
The number of clusters can be chosen randomly or estimated by first performing a
hierarchical clustering of the data. Following this step, an average expression
profile (centroid) is calculated for each cluster, this is called initialization. Next,
individual objects are reattributed from one cluster to the other depending on which
centroid is closer to the gene. This procedure of calculating the centroid for each
cluster and re-grouping objects closer to available centroids is performed in an
iterative manner for a fixed number of times, or until convergence.
34
2. Non-hierarchical clustering: Self Organizing Maps
Fig. 2.8 Self-Organizing Maps
Self Organizing Maps (SOMs) work in a manner similar to K-means clustering
(Fig. 2.8). In K-means clustering, one chooses the number of clusters to fit the data,
whereas with SOM the first step is to choose the number and orientation of the
clusters with respect to each other. For example, a two-dimensional grid of nodes
could be the starting point. The grid is projected onto the expression space and each
object is assigned a node that is nearest to it – this is called initialization. In the next
step, a random object is chosen and the node (called a reference vector) which is in
the neighborhood of the object is moved closer to it. The other nodes are moved to a
small extent depending on how close they are to the object chosen. In successive
iterations, with randomly chosen objects, the positions of the nodes are refined and
the radius of neighborhood becomes confined. In this way, the grid of nodes
(initially a two-dimensional grid) is deformed to fit the data. The advantage of this
method, unlike K-means, is that SOM does not force the number of clusters to be
equal to the number of starting nodes in the chosen grid. This is because some
nodes may have no objects associated with them when the map is complete. Other
35
advantages of SOM include providing information on the similarity between the
nodes, and the ability of SOM to produce reliable results even with noisy data.
2.6.4 Predicting protein interactions and protein functions
Integrating expression data with other external information, for example
evolutionary conservation of proteins, have been used to predict interacting proteins,
protein complexes, and protein function. Work by Ge et al. [30] and Jansen and Gerstein
[31] have shown that genes with similar expression profiles are more likely to encode
proteins that interact. When this information is combined with evolutionary conservation
of proteins, meaningful predictions can be made. In a recent work by Teichmann and
Madan Babu [32], it was shown that proteins that are evolutionarily conserved in yeast
and worm and that have similar expression profiles in both organisms tend to be a part of
the same stable complex or interact physically. Noort et al [33] have also shown that the
encoded proteins of conserved, co-expressed gene pairs are highly likely to be part of the
same pathway. Such studies enable us to predict specific gene functions.
Clustering techniques have proven to be helpful to understand gene function, gene
regulation, cellular processes, and subtypes of cells. Genes with similar expression
patterns (co-expressed genes) can be clustered together with similar cellular functions.
This approach may further help in understanding the functions of many genes for which
information has not been previously available. Furthermore, co-expressed genes in the
same cluster are likely to be involved in the same cellular processes, and a strong
correlation of expression patterns between those genes indicates co-regulation. Searching
for common DNA sequences at the promoter regions of genes within the same cluster
36
allows regulatory motifs specific to each gene cluster to be identified and cis-regulatory
elements to be proposed. The inference of regulation through the clustering of gene
expression data also gives rise to hypotheses regarding the mechanism of the
transcriptional regulatory network. Finally, clustering different samples on the basis of
corresponding expression profiles may reveal sub-cell types which are hard to identify by
traditional morphology-based approaches.
37
CHAPTER III
MATERIALS AND METHODS
3.1 Data Set
The crucial step in developing a good prediction system is always to obtain a
good data set. In this study, we used the complete yeast (Sacchyromyces cerevisiae)
proteome.
Before we started designing the algorithm, we downloaded and studied three major
data files
1. Yeast.dat, which contains the sequence information for the entire proteome
2. GOGroups.dat, which contains the annotation details for the Gene Ontology
3. YeastMicroarray.dat, which contains the gene expression data for a complete
yeast cell-cycle
3.1.1 Complete yeast proteome
The July 2005 version of the complete proteome data consisting of 6247 proteins
was downloaded from the UniProt Knowledgebase (Swiss-Prot and TrEMBL) – Protein
Knowledgebase using the ExPASy (Expert Protein Analysis System) proteomics server
of the Swiss Institute of Bioinformatics (SIB) dedicated to the analysis of protein
sequences and structures [34].
38
Swiss-Prot is an annotated protein sequence database, which was established in 1986 and
maintained collaboratively, since 1987, by the group of Amos Bairoch first at the
Department of Medical Biochemistry of the University of Geneva and now at the Swiss
Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL
Outstation- the European Bioinformatics Institute (EBI)).
The Swiss-Prot Protein Knowledgebase consists of sequence entries. Each entry
corresponds to a single contiguous sequence as contributed to the bank or reported in the
literature. The entries are structured so as to be usable by human readers as well as
computer programs. The explanations, descriptions, classifications and other comments
are in English. Wherever possible, symbols familiar to biochemists, protein chemists and
molecular biologists are used. Each sequence entry is composed of lines. Different types
of lines, each with their own format, are used to record the various data that make up the
entry. Sequence entries are composed of different line types, each with its own format
[35].
For each sequence entry the core data consists of:
• The sequence data;
• The citation information (bibliographical references);
• The taxonomic data (description of the biological source of the protein).
A sample sequence entry is shown below.
ID 2A5D_YEAST STANDARD; PRT; 757 AA.
AC P38903;
DT 01-FEB-1995 (Rel. 31, Created)
DT 01-OCT-1996 (Rel. 34, Last sequence update)
DT 10-MAY-2005 (Rel. 47, Last annotation update)
DE Serine/threonine protein phosphatase 2A, 56 kDa regulatory subunit,
DE delta isoform (PP2A, B subunit, B' delta isoform) (RTS1 protein) (SCS1
39
DE protein).
GN Name=RTS1; Synonyms=SCS1; OrderedLocusNames=YOR014W; ORFNames=OR26.04;
OS Saccharomyces cerevisiae (Baker's yeast).
OC Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
OC Saccharomycetales; Saccharomycetaceae; Saccharomyces.
OX NCBI_TaxID=4932;
RN [1]
RP NUCLEOTIDE SEQUENCE.
RX MEDLINE=96271526; PubMed=8846889;
RA Evangelista C.C. Jr., Rodriguez Torres A.M., Limbach M.P.,
RA Zitomer R.S.;
RT "Rox3 and Rts1 function in the global stress response pathway in
RT baker's yeast.";
RL Genetics 142:1083-1093(1996).
RN [2]
RP NUCLEOTIDE SEQUENCE, AND SUBCELLULAR LOCATION.
RX MEDLINE=96009589; PubMed=7565713;
RA Shu Y., Hallberg R.L.;
RT "SCS1, a multicopy suppressor of hsp60-ts mutant alleles, does not
RT encode a mitochondrially targeted protein.";
RL Mol. Cell. Biol. 15:5618-5626(1995).
RN [3]
RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC STRAIN=S288c / FY1679;
RX MEDLINE=97313270; PubMed=9169874;
RA Dujon B., Albermann K., Aldea M., Alexandraki D., Ansorge W.,
RA Arino J., Benes V., Bohn C., Bolotin-Fukuhara M., Bordonne R.,
RA Boyer J., Camasses A., Casamayor A., Casas C., Cheret G.,
RA Cziepluch C., Daignan-Fornier B., Dang D.V., de Haan M., Delius H.,
RA Durand P., Fairhead C., Feldmann H., Gaillon L., Galisson F.,
RA Gamo F.-J., Gancedo C., Goffeau A., Goulding S.E., Grivell L.A.,
RA Habbig B., Hand N.J., Hani J., Hattenhorst U., Hebling U.,
RA Hernando Y., Herrero E., Heumann K., Hiesel R., Hilger F., Hofmann B.,
RA Hollenberg C.P., Hughes B., Jauniaux J.-C., Kalogeropoulos A.,
RA Katsoulou C., Kordes E., Lafuente M.J., Landt O., Louis E.J.,
RA Maarse A.C., Madania A., Mannhaupt G., Marck C., Martin R.P.,
RA Mewes H.-W., Michaux G., Paces V., Parle-McDermott A.G., Pearson B.M.,
RA Perrin A., Pettersson B., Poch O., Pohl T.M., Poirey R.,
RA Portetelle D., Pujol A., Purnelle B., Ramezani Rad M., Rechmann S.,
RA Schwager C., Schweizer M., Sor F., Sterky F., Tarassov I.A.,
RA Teodoru C., Tettelin H., Thierry A., Tobiasch E., Tzermia M.,
RA Uhlen M., Unseld M., Valens M., Vandenbol M., Vetter I., Vlcek C.,
RA Voet M., Volckaert G., Voss H., Wambutt R., Wedler H., Wiemann S.,
RA Winsor B., Wolfe K.H., Zollner A., Zumstein E., Kleine K.;
RT "The nucleotide sequence of Saccharomyces cerevisiae chromosome XV.";
RL Nature 387:98-102(1997).
RN [4]
RP FUNCTION, SUBUNIT, AND PHOSPHORYLATION.
RX MEDLINE=97299674; PubMed=9154823;
RA Shu Y., Yang H., Hallberg E., Hallberg R.;
RT "Molecular genetic analysis of Rts1p, a B' regulatory subunit of
RT Saccharomyces cerevisiae protein phosphatase 2A.";
RL Mol. Cell. Biol. 17:3242-3253(1997).
CC -!- FUNCTION: The B regulatory subunit might modulate substrate
CC selectivity and catalytic activity, and also might direct the
CC localization of the catalytic enzyme to a particular subcellular
CC compartment.
CC -!- FUNCTION: Multicopy suppressor of ROX3 and HSP60.
CC -!- SUBUNIT: PP2A consists of a common heterodimeric core enzyme,
CC composed of a 36 kDa catalytic subunit (subunit C) and a 65 kDa
CC constant regulatory subunit (PR65 or subunit A), that associates
CC with a variety of regulatory subunits. Proteins that associate
CC with the core dimer include three families of regulatory subunits
CC B (the R2/B/PR55/B55, R3/B''/PR72/PR130/PR59 and R5/B'/B56
CC families), the 48 kDa variable regulatory subunit, viral proteins,
CC and cell signaling molecules.
CC -!- INTERACTION:
CC P47135:JSN1; NbExp=1; IntAct=EBI-1931, EBI-9422;
CC P53958:YNL042W; NbExp=1; IntAct=EBI-1931, EBI-28694;
40
CC -!- SUBCELLULAR LOCATION: Cytoplasmic and nuclear.
CC -!- PTM: Phosphorylated.
CC -!- SIMILARITY: Belongs to the phosphatase 2A regulatory subunit B
CC family.
CC --------------------------------------------------------------------------
CC This SWISS-PROT entry is copyright. It is produced through a collaboration
CC between the Swiss Institute of Bioinformatics and the EMBL outstation -
CC the European Bioinformatics Institute. There are no restrictions on its
CC use by non-profit institutions as long as its content is in no way
CC modified and this statement is not removed. Usage by and for commercial
CC entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC or send an email to license@isb-sib.ch).
CC --------------------------------------------------------------------------
DR EMBL; U06630; AAB38372.1; -; Genomic_DNA.
DR EMBL; S79635; AAB35312.1; -; Genomic_DNA.
DR EMBL; X87331; CAA60763.1; -; Genomic_DNA.
DR EMBL; Z74922; CAA99203.1; -; Genomic_DNA.
DR PIR; S54620; S54620.
DR IntAct; P38903; -.
DR GermOnline; 143602; -.
DR Ensembl; YOR014W; Saccharomyces cerevisiae.
DR SGD; S000005540; RTS1.
DR GO; GO:0005935; C:bud neck; IDA.
DR GO; GO:0005737; C:cytoplasm; IDA.
DR GO; GO:0005634; C:nucleus; IDA.
DR GO; GO:0000159; C:protein phosphatase type 2A complex; TAS.
DR GO; GO:0005816; C:spindle pole body; IDA.
DR GO; GO:0000158; F:protein phosphatase type 2A activity; TAS.
DR GO; GO:0006470; P:protein amino acid dephosphorylation; TAS.
DR InterPro; IPR008938; ARM.
DR InterPro; IPR002554; B56.
DR Pfam; PF01603; B56; 1.
KW Complete proteome; Hydrolase; Nuclear protein; Phosphorylation;
KW Protein phosphatase.
FT COMPBIAS 16 22 Poly-Ser.
FT COMPBIAS 46 51 Poly-Ser.
FT COMPBIAS 98 110 Poly-Ser.
FT COMPBIAS 143 147 Poly-Ser.
FT COMPBIAS 202 213 Poly-Asn.
FT CONFLICT 95 95 T -> S (in Ref. 2).
FT CONFLICT 529 529 L -> F (in Ref. 1).
FT CONFLICT 581 581 A -> R (in Ref. 1).
SQ SEQUENCE 757 AA; 85335 MW; 5A7476C30140331C CRC64;
MMRGFKQRLI KKTTGSSSSS SSKKKDKEKE KEKSSTTSST SKKPASASSS SHGTTHSSAS
STGSKSTTEK GKQSGSVPSQ GKHHSSSTSK TKTATTPSSS SSSSRSSSVS RSGSSSTKKT
SSRKGQEQSK QSQQPSQSQK QGSSSSSAAI MNPTPVLTVT KDDKSTSGED HAHPTLLGAV
SAVPSSPISN ASGTAVSSDV ENGNSNNNNM NINTSNTQDA NHASSQSIDI PRSSHSFERL
PTPTKLNPDT DLELIKTPQR HSSSRFEPSR YTPLTKLPNF NEVSPEERIP LFIAKVDQCN
TMFDFNDPSF DIQGKEIKRS TLDELIEFLV TNRFTYTNEM YAHVVNMFKI NLFRPIPPPV
NPVGDIYDPD EDEPVNELAW PHMQAVYEFF LRFVESPDFN HQIAKQYIDQ DFILKLLELF
DSEDIRERDC LKTTLHRIYG KFLSLRSFIR RSMNNIFLQF IYETEKFNGV AELLEILGSI
INGFALPLKE EHKVFLVRIL IPLHKVRCLS LYHPQLAYCI VQFLEKDPLL TEEVVMGLLR
YWPKINSTKE IMFLNEIEDI FEVIEPLEFI KVEVPLFVQL AKCISSPHFQ VAEKVLSYWN
NEYFLNLCIE NAEVILPIIF PALYELTSQL ELDTANGEDS ISDPYMLVEQ AINSGSWNRA
IHAMAFKALK IFLETNPVLY ENCNALYLSS VKETQQRKVQ REENWSKLEE YVKNLRINND
KDQYTIKNPE LRNSFNTASE NNTLNEENEN DCDSEIQ
//
Each line begins with a two-character line code, which indicates the type of data
contained in the line. The current line types and line codes and the order in which they
appear in an entry, are shown in the Table 3.1.
41
Line code Content Occurrence in an entry
ID Identification Once; starts the entry
AC Accession number(s) Once or more
DT Date Three times
DE Description Once or more
GN Gene name(s) Optional
OS Organism species Once
OG Organelle Optional
OC Organism classification Once or more
OX Taxonomy cross-reference Once
RN Reference number Once or more
RP Reference position Once or more
RC Reference comment(s) Optional
RX Reference cross-reference(s) Optional
RG Reference group Once or more (Optional if RA line)
RA Reference authors Once or more (Optional if RG line)
RT Reference title Optional
RL Reference location Once or more
CC Comments or notes Optional
DR Database cross-references Optional
KW Keywords Optional
FT Feature table data Optional
SQ Sequence header Once
(blanks) Sequence data Once or more
// Termination line Once; ends the entry
Table 3.1 Line types and line codes for Swiss-Prot sequence entry [35]
The two-character line-type code that begins each line is always followed by three
blanks, so that the actual information begins with the sixth character. In general,
information is not extended beyond character position 75, there are however a few
exceptions where lines may be longer. Since the entire database follows the same
standardized format, it is computer readable and can be programmed to access this
database information and process it further for the analysis.
Out of this information about one protein sequence, only the information on
42
1. AC – accession number,
2. DR – database cross-reference and
3. SQ – sequence
is actually used for the prediction algorithm.
The AC (Acession number) line lists the accession number(s) associated with an entry.
The format of the AC line is:
AC AC_number_1;[ AC_number_2;]...[ AC_number_N;]
An example of an accession number line is shown below:
AC P00321;
Semicolons separate the accession numbers and a semicolon terminates the list. If the
information is unable to fit in a single line, more than one AC lines can be used.
The accession number is used by the prediction algorithm to uniquely identify each
protein and get its annotation from the Gene Ontology (GO).
The DR (Database cross-Reference) lines are used as pointers to information related to
entries and found in data collections other than Swiss-Prot.
The format of the DR line is:
DR DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER[;
TERTIARY_IDENTIFIER][; QUATERNARY_IDENTIFIER].
The first item on the DR line, the 'DATABASE_IDENTIFIER', is the abbreviated name
of the data collection to which reference is made. The second item on the DR line, the
'PRIMARY_IDENTIFIER', is an unambiguous pointer to the information entry in the
database to which the reference is being made. The third item on the DR line, the
43
'SECONDARY_IDENTIFIER', is generally used to complement the information given
by the first identifier. A number of DR lines possess a fourth item, the
'TERTIARY_IDENTIFIER', which is generally used to give more detailed information.
Since we are using the GO annotation, we deal with the DR lines with the
'DATABASE_IDENTIFIER' as GO. For GO, the 'PRIMARY_IDENTIFIER' gives us the
primary GO id which gives is actually referenced in the GO database. The
'SECONDARY_IDENTIFIER' is either one of the characters C, F or P for the 3 principal
categories of the GO Database- C: cellular component; F: molecular function; and P:
biological process.
DR GO; GO:0005816; C:spindle pole body; IDA.
DR GO; GO:0000158; F:protein phosphatase type 2A activity; TAS.
DR GO; GO:0006470; P:protein amino acid dephosphorylation; TAS.
Only molecular function (F) ontology is considered for the prediction model.
The SQ (sequence data) line has a line code consisting of two blanks rather than the two-
letter codes used until now. The sequence counts 60 amino acids per line, in groups of 10
amino acids, beginning in position 6 of the line.
An example of sequence data preceded by the corresponding SQ line is shown here:
SQ SEQUENCE 97 AA; 9110 MW; E3C20C259858B830 CRC64;
MTILASICKL GNTKSTSSSI GSSYSSAVSF GSNSVSCGEC GGDGPSFPNA SPRTGVKAGV
NVDGLLGAIG KTVNGMLISP NGGGGGMGMG GGSCGCI
The // (terminator) line contains no data or comments and designates the end of an entry.
3.1.2 GO Annotations
A GO definition file was obtained from the Gene Ontology consortium web site
[36]. In the version of July 2005, there are 19094 GO terms including
44
• 9856 biological process terms,
• 7559 molecular function terms, and
• 1679 cellular component terms.
The GO annotation file is available in the OBO file format and is maintained in
ASCII doc format in the Gene Ontology CVS repository. The OBO file format itself
attempts to achieve the goals of human readability, ease of parsing, extensibility and
minimal redundancy.
The OBO file format consists of a and s. The header is an
unlabeled section at the beginning of the document containing tag-value pairs. A stanza is
a labeled section of the document, indicating that an object of a particular type is being
described. Stanzas consist of a stanza name in square brackets, and a series of tag-value
pairs. Tag-value pairs consist of a tag name, a colon, the tag value, and a newline.
: {} !
The tag name is always a string. The value is always a string, but the value string may
require special parsing depending on the tag with which it is associated. In general, tag-
value pairs occur on a single line.
A sample stanza entry is shown.
[Term]
id: GO:0000006
name: high affinity zinc uptake transporter activity
namespace: molecular_function
def: "Catalysis of the reaction: Zn2+(out) = Zn2+(in), probably powered by proton motive
force." [TC:2.A.5.1.1]
xref_analog: TC:2.A.5.1.1
is_a: GO:0005385 ! zinc ion transporter activity
At present, every OBO stanza always begins with an id tag. The value of the id tag
announces the object to which the rest of the tags in the stanza refer. Normal, non-
anonymous ids have global scope. An object has the same id in every file, and in every
45
namespace. In our parsing system for using the GO annotation, we just deal with some of
the tags mentioned below.
id The unique id of the current term
name The term name. Any term may have only one name defined
namespace The ontology to which the term belongs
alt_id Defines an alternate id for this term
def The definition of the current term
xref A dbxref that describes an analagous term in another vocabulary
is_a This tag describes a subclassing relationship between one term and
another. The value is the id of the term of which this term is a subclass. A
term may have any number of is_a relationships.
relationship This tag describes a typed relationship between this term and another term.
is_obsolete Whether or not this term is obsolete.
3.1.3 Yeast Gene Expression Data
The second kind of yeast data used was the gene expression data for a complete
yeast cell cycle. The availability of complete sequence for the Saccharomyces cerevisiae
genome has made it possible to measure mRNA transcript levels for virtually every yeast
gene. In the study by Cho et al. [37], commercially available high-density oligonucleotide
arrays were used to estimate mRNA transcript levels in synchronized yeast cells at
regular intervals during the cell cycle. DNA oligonucleotide probes are directly
synthesized on these arrays without individual manipulation or PCR amplification,
minimizing the potential for cross-hybridization or clone error. The data was generated
by the Yeast Cell Cycle Analysis Project by Spellman et al. [38], which is available at the
Genomics server of Stanford University Medical Center [39].
46
The dataset includes 24 gene expression profiles measuring the mRNA expression
levels of the proteins during the growth of yeast cells. Each profile includes 6178 related
proteins. A small part of the micro array data is shown in Table 3.2. The rows represent
different genes while the columns represent the gene expression levels at different time
instants.
cdc15_10 cdc15_30 cdc15_50 cdc15_70 cdc15_80 cdc15_90 cdc15_100
YBR217W -0.26 -0.48 -0.07 -0.34 0.23 -0.48 0.35
YBR218C 0.08 0.13 -0.35 -0.25 -0.25 -0.23 -0.16
YBR219C 0.26 0.17 0.18 0.15 0 0.04 -0.42
YBR220C 0.25 -0.09 0.09 -0.14 0.33 -0.14 0.4
YBR221C -0.56 -0.73 -0.5 -0.18 0.05 0.01 -0.46
YBR222C -0.84 -0.79 -0.02 0.08 0.48 0.26 0.25
YBR223C 0.13 0.18 0.18 -0.25 -0.21 -0.04 -0.32
YBR224W -0.2 0.24 0.02 -0.62 0.17 -0.44 0.27
YBR225W 0.2 0.34 0.2 -0.1 -0.16 0.07 -0.07
YBR226C 0.23 0.42 0.1 -0.1 0.11 -0.05 0.25
YBR227C 0.29 0.62 0.68 -0.01 -0.61 0.27 -0.29
YBR228W 0.11 0.25 -0.04 -0.32 0.21 -0.22 0.07
YBR229C 0.02 -0.16 -0.16 -0.06 -0.09 0.15 -0.42
YBR230C -0.3 -0.69 -0.56 -0.88 0.68 -0.2 0.63
YBR231C 0 0.37 -0.11 -0.5 -0.1 -0.36 0.03
YBR232C -0.04 0.29 0 -0.55 -0.06 -0.61 -0.09
YBR233W -0.06 0.22 0.19 -0.26 -0.1 -0.52 0.14
YBR234C -0.53 -0.8 -0.76 -0.29 0.49 -0.24 0.17
YAL020C -0.38 -0.42 -0.28 -0.15 0.16 -0.12 0.26
YAL021C 0.17 0.22 -0.09 0.19 -0.17 0.16
Table 3.2 Yeast gene expression data [39]
The data was recorded over a complete yeast cell cycle. The columns represent different
time slots varying from 10 minutes to 270 minutes. The rows represent the different
genes and the data is the gene expression levels in the form of logarithmic ratios.
47
3.2 Data pre-processing
The files downloaded from the various servers are not used in their entirety. So
they need to be pre-processed to extract the actual information that is necessary and also
to get them in the appropriate format to be processed easily and in a time-efficient
manner.
From the Yeast Proteome file (Yeast.dat) we extract the AC line, the DR line with
the GO database reference to the molecular function (F) ontology and the SQ line with
the complete sequence. The GO ids for the annotated molecular functions are stored
along with the protein accession id in a separate file. The protein accession id along with
the sequence is stored in another file. The first file forms a lookup table to look up for the
GO annotation for any yeast protein. The sequences in the latter file are used in the
BLAST program to obtain the similarity scores among all the sequences.
Similarly, from the Gene Ontology file, only the required information like the GO
id and the “is_a” or “part_of” information are extracted for the molecular function
ontology to form a file from which a tree structure of the entire GO annotation can be
built. The obsolete terms are not considered and the alternate ids (alt_id) are also
recorded if they exist.
Perl codes were written to perform all these pre-processing operations.
In case of the gene expression data, it was found out that there were a lot of genes
with missing data. So the genes with 50% or more of missing data were not considered.
As a result 516 genes were eliminated leaving a data of 5662 genes.
The pre-processing of the gene expression data was performed using MathWorks
MATLAB® R2006a and Microsoft® Excel 2003.
48
3.3 Sequence similarity scoring
Sequence similarities can be measured through pair-wise global alignments or
local alignments. Homologous protein sequences are usually similar over active domains
and thus share common folds and functions. Therefore, local alignment is a more
appropriate method for comparing protein sequences for their functional similarity. There
are several local alignment schemes for comparing protein sequences, including BLAST
that can be used together with different scoring systems such as BLOSUM62 and
BLOSUM80.
3.3.1 BLAST Algorithm
Fig. 3.1 BLAST Algorithm [40]
49
The BLAST algorithm is a heuristic search method that seeks words of length W (default
= 3 in blastp) that score at least T when aligned with the query and scored with a
substitution matrix. Words in the database that score T or greater are extended in both
directions in an attempt to find a locally optimal ungapped alignment or HSP (high
scoring pair) with a score of at least S or an E value lower than the specified threshold.
HSPs that meet these criteria will be reported by BLAST, provided they do not exceed
the cutoff value specified for number of descriptions and/or alignments to report [40].
The program returns a list of local alignments of certain statistical significance.
However, how to measure the overall similarity of two protein sequences is not obvious.
For example, proteins with two similar domains with certain similarity scores could be
considered to be much more similar than proteins with only one domain with a higher
similarity score. In this study, a novel measure of overall similarity of two protein
sequences is introduced. We utilize the alignment tool for BLASTing two sequences to
obtain the list of optimal local alignments. Let {S1, …,Sn} be scores of a list of best local
alignments with certain statistical significance. Instead of using the highest score
(max{S1, …,Sn}) in the list to measure the overall similarity of the two protein sequences,
we use the following score S to measure the overall similarity of two sequences [41].
S = −∑ ln pi , (1)
i
where pi = 1 − e − E stands for the probability of finding a high-scoring segment pair (HSP)
i
with a local alignment score of at least Si, and Ei is the expected number of HSPs of score
50
at least Si and can be obtained directly from the alignment tool. Assuming the HSPs are
independent of each other, the p-value:
p = ∏ pi = e − S (2)
i
measures the probability of finding a pair of protein sequences with a list of scores at
least {S1, …, Sn}. The corresponding E-value for the overall similarity (the expected
number of the lists that have scores at least {S1, …, Sn}) therefore can be written as:
E = − ln(1 − e − S ) . (3)
We use the p-values and the E-values to measure the overall similarity of a pair of protein
sequences. Since when | x |;
close(INFILE);
open(INFILE,";
close(INFILE);
open(INFILE,";
close(INFILE);
open(INFILE,";
close(INFILE);
77
open(OUTFILE,">$outReportFile") || die("Could not create
$outReportFile"); # Output report
%results=("correct",0,"incorrect",0,"failed",0);
$total=0;
foreach $queryprot(@prot)
{
$total++;
chomp $queryprot;
print "Query # $total\tProcessing query $queryprot\n";
$result=prediction($queryprot);
$results{$result}++;
}
print(OUTFILE "\n\nTotal Predictions : \t$total\nCorrect Predictions :
\t$results{'correct'}\nIncorrect Predictions :
$results{'incorrect'}\nFailed Predictions : \t$results{'failed'}\n");
print("\n\nTotal Predictions : \t$total\nCorrect Predictions :
\t$results{'correct'}\nIncorrect Predictions :
$results{'incorrect'}\nFailed Predictions : \t$results{'failed'}\n");
close(OUTFILE);
#======================================================================
============================
# Subroutine to predict the function of the protein
#======================================================================
============================
sub prediction
{
my($query_protein) = @_;
my($K) = 10;
my(%ClassifierCount)=();
my(%MatchProtein)=();
my($hit)=0;
my($prev_protein)=0;
my($prev_GO)=0;
my(@Matches) = getMatch($query_protein);
foreach $eachMatch(@Matches)
{
if($eachMatch eq "0")
{
print(OUTFILE "$query_protein : PREDICTION FAILED - Invalid
query protein\n");
return "failed";
}
my ($MProtein,$MScore)=split(/\t/,$eachMatch);
$MatchProtein{$MProtein} = "$MScore";
}
my(@keys) = sort {$MatchProtein{$a} $MatchProtein{$b}} (keys
%MatchProtein);
if(scalar @keys $ClassifierCount{$a}} (keys
%ClassifierCount);
@Q_node=getGOnode($query_protein);
foreach $eachQ_node(@Q_node)
{
if($eachQ_node eq "0")
{
print(OUTFILE "$query_protein : PREDICTION FAILED - Query
protein GO node not existing\n");
return "failed";
}
@Q_path=getGOpath($eachQ_node);
foreach $eachQ_path(@Q_path)
{
if($eachQ_path eq "0")
{ last; }
@Q_GOPath = split(/\s/,$eachQ_path);
@Q_GOPath=reverse(@Q_GOPath);
if($keys[0] eq $Q_GOPath[1])
{ $hit=1; }
79
}
}
if($hit==1)
{ print(OUTFILE "$query_protein : PREDICTION CORRECT\n"); return
"correct"; }
else
{ print(OUTFILE "$query_protein : PREDICTION INCORRECT\n"); return
"incorrect"; }
}
#======================================================================
============================
# Subroutine to obtain significantly similar proteins using Score file
#======================================================================
============================
sub getMatch
{
my($query)=@_;
my($check)=0;
my(@Match);
foreach $score_line(@score)
{
chomp $score_line;
if($score_line =~ /$query/)
{
$check =1;
$match_line=$score_line;
$match_line=~s/$query//gi;
push(@Match,"$match_line");
}
}
if($check==0)
{ push(@Match,"0"); }
return @Match;
}
#======================================================================
============================
# Subroutine to obtain GO node terms of the protein from Yeast data
file
#======================================================================
============================
sub getGOnode
{
my($protein)=@_;
my($det)=0;
foreach $data_line(@data)
{
chomp $data_line;
80
if($data_line=~/^AC.*$protein/ && $det==0)
{ $det=1; }
elsif($det==1 && $data_line=~/^GO/)
{
$GO_line=$data_line;
$GO_line=~s/GO:/ GO:/gi;
my(@GOlist)=split(/\s/,$GO_line);
shift(@GOlist);
return @GOlist;
}
else
{ $det=0; }
}
push(@GOlist,"0");
return @GOlist;
}
#======================================================================
============================
# Subroutine to obtain the various GO paths for a GO node term
#======================================================================
============================
sub getGOpath
{
my ($GOnode)=@_;
my (@GOTree)=(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
my ($m)=1;
my ($det)=0;
my ($flag)=0;
my ($currPath)=$GOnode;
my ($currGO)=$GOnode;
$j=0;
$GOTree[0]=$GOnode;
for($i=0;$i<$m;$i++)
{
$currGO=lastterm($GOTree[$i]);
while($currGO ne "GO:0003674")
{
$currPath=$GOTree[$i];
$currGO=lastterm($currPath);
foreach $group_line(@group)
{
chomp $group_line;
if($group_line=~/^GO_id: $currGO/)
{ $flag=1; $det=1; }
elsif($flag==1 && $group_line=~/^is_a/)
{
$flag=0;
@is_a_list=split(/\s/,$group_line);
$num_is_a=scalar(@is_a_list)-2;
$k=$num_is_a;
$m=$m+$k-1;
for($j=0;$j<$k;$j++)
81
{
if($GOTree[$j] eq "0" || $GOTree[$j] eq "$currPath")
{ $GOTree[$j]="$currPath $is_a_list[($num_is_a+2)-($k-
$j)]"; }
else
{ $k++; next; }
}
$currGO=lastterm($GOTree[$i]);
last;
}
}
if($det==0)
{ unshift(@GOTree,"0"); last; }
}
}
for($i=$j;$j<15;$j++)
{ pop @GOTree; }
return @GOTree;
}
#======================================================================
============================
# Subroutine to obtain the last term in any string
#======================================================================
============================
sub lastterm
{
my($str)=@_;
my(@substr)=split(/\s/,$str);
my $numstr=scalar(@substr);
return $substr[$numstr-1];
}
82