In: Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics,
Wiley Interscience (in press, 2005)
Centre for Integrative Bioinformatics (IBIVU)
Faculty of Sciences / Faculty of Earth and Life Sciences
Vrije Universiteit Amsterdam, The Netherlands
Tel: +31 20 4447649
Fax: +31 20 4447653
Protein, domain, domain structure, protein module, domain motion, domain linker,
domain prediction, domain database
Many cellular processes involve proteins comprised of multiple domains. The modular
nature afforded by breaking up a protein’s three-dimensional structure into discrete
domains has many evolutionary advantages. First, it allows proteins to evolve to complex
cooperative functions that require elaborate, large and stable structural scaffolds, which
would be impossible to form otherwise within the structural constraints associated with
the protein folding process. Second, it can bring active sites in proximity in a fixed
stoichiometric ratio. Third, otherwise unstable intermediates can be accommodated
within inter-domain clefts. Fourth, the flexibility afforded by protein domain linker
regions, e.g. hinge regions, can greatly aid the activity of the cooperative catalytic
process carried out by a multi-domain enzyme. Fifth, the repeated and combinatorial use
of domains in multi-domain proteins has provided nature both an evolutionary economic
and versatile system to maintain and enhance cellular activity.
Protein domains are independent or semi-independent folding units of protein structure
that serve as structural building blocks of proteins. Their size varies from 36 amino acids
in the protein E-selectin to 692 residues in lipoxgenase-1 (Jones et al., 1998). Most
domains observed in nature have less than 200 residues, with an average domain size of
about 100 residues. While smaller domains of less than 50 residues are often structurally
reinforced by disulfide bonds or coordinated metal ions, large domains comprising more
than 300 residues are expected to contain multiple hydrophobic cores.
Domains are genetically mobile units, and multi-domain families are found throughout
the three kingdoms (Archaea, Bacteria and Eukarya). The majority of genomic proteins,
75% in unicellular organisms and more than 80% in metazoa, are multi-domain proteins
created as a result of gene duplication events (Apic et al., 2001). Genetic mechanisms
influencing the layout of multi-domain proteins include gross rearrangements such as
inversions, translocations, deletions and duplications, homologous recombination, and
slippage of DNA polymerase during replication (Bork et al., 1992). Although genetically
conceivable, the transition from two single domain proteins to a two-domain protein
requires that both domains fold correctly and, that they accomplish to bury a fraction of
the previously solvent-exposed surface area in a newly generated inter-domain surface,
the size of which is depending on the constraints imposed by the domain linker region
and the scale of domain interaction.
Protein structure is intrinsically hierarchic in its internal organization. It progresses from
secondary structure elements at the lowest level, via supersecondary structures and via
domains to complete proteins and assemblies of proteins at the highest level. At the
domain level, the connectivity of the polypeptide backbone between the domains appears
to be less important, as the protein retains a stable structure irrespective of the sequential
ordering of the domains. Moreover, evolution has led to various forms of domain
arrangements across the protein sequence that can be classified by their connectivity (Das
and Smith, 2000). These arrangements go beyond sequential permutations of complete
domain structures and can affect the internal domain organization. Figure 1 displays the
observed modes of connectivity between domains ordered by the degree of complexity.
Based on their connectivity, domains can be divided into two main classes: continuous
domains, where the associated coding sequence is an uninterrupted stretch of amino
acids, and discontinuous domains, where the coding sequence is interrupted by
subsequences encoding alternative structures, such as inserted domains. An example of a
multi-domain structure with a combination of continous and discontinuous domains is the
3-domain structure Pyruvate kinase, a member of the phosphotransferase family, of
which the structure and connectivity is given in Figure 2.
A protein module has been defined as a continuous domain structure that is structurally
independent enough to serve as a modular plug-in to many multi-domain proteins.
Modules are believed to have spread largely by exon shuffling and are observed in many
different multi-domain proteins with permuted or altogether different domain
combinations. The fact that modules often show the N- and C-termini in close proximity
appears to facilitate their addition to other protein structures, since their insertion in loop
regions would not require gross rearrangements.
A structurally viable mechanism for forming oligomeric domain assemblies is provided
by domain swapping (Bennett et al., 1995). In domain swapping, a secondary or tertiary
element of a monomeric protein is replaced by the same element of another protein.
Domain swapping can range from secondary structure elements to whole structural
domains in bringing together single-domain or multi-domain structures. It also represents
a model of evolution for functional adaptation by oligomerization, e.g. of oligomeric
enzymes that have their active site at sub-unit interfaces (Heringa and Taylor, 1997).
Predicting domain boundaries in multidomain proteins
A large number of structural classification methods have been developed to group the
large number of protein domain folds deposited in the Protein Data Bank (PDB). The
most widely used structural databases are CATH (Orengo et al., 1997), SCOP (Lo Conte
et al., 2000), FSSP (Holm and Sander, 1996) and 3Dee (Siddiqui et al., 2001). Links to
web servers for these databases are provided in Table 1. For each database, a unique
protocol is followed to classify the protein structures at the domain level. Since a
common operational definition of a domain is that of a compact globular substructure
with more interactions within itself than with the rest of the structure (Janin and Wodak,
1983), two characteristics are commonly exploited in boundary prediction methods:
extent of isolation and compactness (Tsai and Nussinov, 1997). Measures of local
compactness have been used in a number of recent domain assignment methods (Holm
and Sander, 1994; Islam et al., 1995; Sowdhamini and Blundell, 1995; Siddiqui and
Barton, 1995; Nicols et al., 1995; Zehfus, 1997; Hinsen et al., 1999; Taylor, 1999; Xu et
al., 2000; Siddiqui et al., 2001; Guo et al., 2003), albeit these approaches generally
cannot handle discontinuous or closely interacting domains very well. As a result, Hadley
and Jones (1999) observed discrepancies between assignments made by the above
structural domain databases.
A large number of profile and HMM methods are available to delineate the domain
structure in a query sequence. These methods make use of domain classifications in
protein domain sequence databases, and use the deposited sequences of each domain
family to derive a profile or HMM. Each profile or HMM is then used to scan the query
sequence, after which the highest scoring ones are used to delineate the putative domains
and their boundaries. Recent developments for these search techniques is the addition of
contextual information such as co-occurrence of domains in a sequence (Coin et al.,
2003) and the taxonomic distribution (Coin et al., 2004).
Although a very difficult task, some ab initio computational methods are also available
for the prediction of domain boundaries from protein sequence information alone. The
method SnapDRAGON (George and Heringa, 2002a) is primarily based on the
hydrophobic collapse during folding and folds a 3D protein structure based on the notion
of conserved hydrophobicity and information from secondary structure prediction
methods. The method takes as input a multiple alignment of the query sequence and a set
of homologues, as well as the predicted secondary structure. It then builds 100 models of
the protein structure, which each are assigned domain boundaries using the
aforementioned method of Taylor (1999). A final prediction of the domain boundaries is
achieved by assessing the consistency of the domain boundaries over the 100 predicted
folds and by determining statistically significant boundaries .
Domain boundary identification by comparative sequence analysis is performed by the
method DOMAINATION (George and Heringa, 2002b), which is based on the homology
search method PSI-BLAST (Altschul et al., 1997) to identify homologues to a query
sequence. The distribution of N- and C-terminal positions of the local alignments
obtained by PSI-BLAST is used to identify putative domain boundaries by a process of
chopping and joining domains and domain segments in an attempt to reconstruct the
evolutionary loss and gain of domains. The method follows an iterative protocol of on-
the-fly cutting and joining of domain fragments and subjecting newly formed fragments
to PSI-BLAST. It is able to recognize continuous and discontinuous domains and
increases the accuracy of PSI-BLAST by about 15%.
Domains in multi-domain structures are likely to have once existed as independent
proteins, and many domains in eukaryotic multi-domain proteins can be found as
independent proteins in prokaryotes (Davidson et al., 1993). As an example of gene
fusion, vertebrates have a multi-enzyme protein (GARs-AIRs-GARt) comprising the
enzymes glycinamide ribonucleotide synthetase (GARs), aminoimidazole ribonucleotide
synthetase (AIRs), and GAR transformylase (GARt) 1. In insects, the corresponding
polypeptide encodes the four-domain structure GARs-(AIRs)2-GARt. However, GARs-
AIRs is encoded separately from GARt in yeast, and in bacteria each domain is encoded
separately (Henikoff et al., 1997).
Marcotte et al. (1999) introduced the ‘Rosetta Stone’ computational method for inferring
protein function and interactions from genome sequences based on the observation that
some pairs of interacting proteins have homologs in another organism fused into a single
protein chain. A comparison of sequence homologs from multiple organisms can reveal
these fused sequences, which are called Rosetta Stone sequences because the existence of
these multi-domain proteins implies the putative interactions between the separate protein
pairs in other organisms.
Another genomic method to predict domain function is phylogenetic profiling (Pellegrini
et al., 1999). The method checks for simultaneous presence or absence of pairs of genes
across a wide spectrum of genomes, and thereby reveals sets of proteins that are likely to
have coevolved and thus can be expected to act in the same cellular process.
The protein phylogenetic profiling and the Rosetta Stone methods have been very useful
in predicting the function of large numbers of genes without experimental functional
Domain evolution and gene duplication
The evolution and spread of domains through different protein families is likely to be the
result of gene duplication, leading to one or more paralogous genes. After such
duplication events, the selection pressure becomes less for the paralogous copies of the
gene, so that these are freer to manufacture adaptive functions over time. For more
recently evolved protein domains, a correspondence between the sequential boundaries of
exons and those of protein domains is observed. It is therefore believed that exon
shuffling during evolution has enabled organisms to develop new proteins with new
functions by the addition of exons from other parts of the genome that form new domains
as part of an existing protein.
A multi-domain architecture offers the potential of reuse trough domain repeats. Domain
duplication is essential for the formation of the active sites of enzymes such as in the
aspartic and serine proteinases, porphobilinogen deaminase or the binding sites of
transport proteins such as lactoferrin, transferrin and sulfate binding protein. Domain
repeats also result in duplication of active sites or binding sites such as the calcium
binding sites in calmodulin or the 9 DNA-binding zinc fingers of the transcription factor
Recent methods to delineate repeats in protein sequences are RADAR (Holm and Heger,
2000), REPRO (George and Heringa, 2000; Romein et al., 2003) and TRUST
(Szklarczyk and Heringa, 2004). Taylor et al. (2002) devised a method based on Fourier
analysis to automatically recognize repeated substructures in protein structures.
An increasing number of cases are identified of non-orthologous displacement (Koonin et
al., 1996), where enzymes carrying out an identical function in different organisms
belong to entirely different protein families, and thus are not expected to show any
sequence similarity. Examples of non-orthologous displacement include ornithine
decarboxylase in E. coli and S. serevisiae, where the isozymes speF and speC are
responsible for this function in E. coli and share the same structure comprising three
domains (ornithine decarboxylase N-terminal “wing” domain, PLP-dependent transferase
and ornithine decarboxylase C-terminal domain). The corresponding enzyme spe1 in S.
cerevisiae is a two-domain protein with entirely different domain structures (PLP-binding
barrel and alanine racemase-like domain). A problem with non-orthologous displacement
is that it cannot be recognised by sequence comparison methods, which are aimed at
recognising divergent evolution by mutation and insertions/deletions, including changes
of gene structure by gene fusion or fission. However, the techniques are not able to trace
evolutionary cases of horizontal gene transfer or functional displacement of one gene by
another one within a genome.
Suppose that a set of proteins is studied that correspond to a linear metabolic pathway,
and that all proteins are found for a given organism, except for one corresponding to an
intermediate step. It can then be expected that the missing activity is performed by some
hitherto undetected protein, in which case sequence comparison techniques can lead to
the identification of the missing protein by a homology search using a homologous
domain sequence in another genome that has the appropriate functional annotation. In the
absence of any putative homolog, non-orthologous gene displacement could be
suspected, in which case direct sequence analysis is of no use. However, functional
ontologies, such as GO (Ashburner et al., 2000), can aid the process, in that various non-
orthologous sequences can be identified using a GO functional descriptor. This leads to a
direct candidate whenever a protein domain is identified in the target genome. Otherwise,
if the ontology identifies a putative non-orthologous gene with the desired function in
another genome, a homology search using the latter gene can reveal putative non-
orthologous genes for the intermediate metabolic step in the query organism.
In the case of a prokaryotic organism, the fact can be exploited that gene sequences in
close proximity in the genome tend to be functionally related. Given a confirmed protein
domain sequence associated with the above metabolic pathway, the neighbourhood of its
homologs in complete prokaryotic genomes can be analysed using the STRING server
(Snel et al., 2000). The latter server has been used, for example, for the re-annotation of
the Mycoplasma pneumoniae genome (Dandekar, et al., 2000).
As a result of a homology search operation, for example by using the program PSI-
BLAST (Altschul et al., 1997), regions with significant similarity between otherwise
unrelated proteins in different species can abound. Such so-called homology regions can
be sustained within different sequence contexts, and often add a specific functionality to
their constituent proteins. In some cases, the function of regional homology domains can
be inferred by comparing the properties of the host proteins, when experimental
functional annotation is available. However, also the detection of these domains in
uncharacterized sequences can provide valuable clues for the protein’s function. In other
cases where the biological role of a local homology domain is not known, the delineation
and comparison of these regions might still be useful e.g. for finding conserved residues
as targets for mutagenesis or for structural studies.
Assessing functional specificity of domain families
Predicting or assessing the functional specificity of a given domain sequence is possible
when a set of homologues is available for a query sequence, and the homologues are
suspected to have a roughly identical function but with subtle variations (e.g.
dehydrogenase domain structures acting on different substrates). Insight can then be
gained by aligning the members of the family and deriving a phylogenetic tree. If the
family is organised into a tree that correlates well with the functions described (or
derived) for the proteins, then one can draw relatively reliable conclusions regarding the
query’s function. If the query falls into a clear subgroup of the tree, one can assume as a
function that of the branch. If, however, the functional annotation of the branch holding
the query is heterogeneous, or the query falls in between two branches, then one should
assume a less specific description in accordance with the neighbouring branches (e.g.
“alcohol dehydrogenase”), or resort to the generic functional description of the whole
family (e.g. “dehydrogenase”).
The transfer of function from one protein to another should only be done when the
protein sequences match over their total lengths in the pairwise sequence alignment,
without any large unaligned sequence fragments that might correspond to an additional
domain. If some fragments do not align, one has to carefully check whether the function
that is transferred really corresponds to the part that is common. The use of domain
databases and methods for domain identification (vide supra) can be of great help at this
If a query sequence contains a domain that is not matched in the pairwise alignment, this
should be noted. Conversely, if the protein used as source of the annotation contains an
unmatched domain, the annotation should just state the presence of the domain(s) of the
query without transferral of the complete functional description of the source.
Protein sequence clustering
Protein sequence cluster databases group related proteins together and are derived
automatically from protein sequence databases using different clustering algorithms.
Each database uses a different clustering algorithm aimed to group the sequences
generally in homologous families and superfamilies. Assignment of a query sequence
without annotation to any of the clusters can provide valuable clues about the protein’s
function. Since they are derived automatically from protein sequence databases without
manual crafting and validation of family discriminators, these databases are relatively
comprehensive, although the biological relevance of clusters can be ambiguous and can
sometimes be an artifact of particular thresholds. Multidomain protein sequences are a
particular source of error, in that the clustering algorithm can establish a link between
two sequences A and B based on a particular domain, while a third sequence C might be
linked to sequence B based on another domain sequence. If unchecked, this would lead to
an incorrect clustering of sequences A, B and C. Table 1 provides a number of links to
websites of protein sequence clustering databases.
A large and growing body of evidence suggests that proteins involved in the regulation of
cellular events such as the cell cycle, signal transduction, protein trafficking, targeted
proteolysis, cytoskeletal organization and gene expression are typically multi-domain
proteins comprising a catalytic domain and one or more interaction domains. Interaction
domains induce the interaction of signaling polypeptides and specific multi-protein
complexes. For example, they can couple a cell surface receptor to an intracellular
biochemical pathway that controls the cellular responses to an external signal. Interaction
domains can be involved in a series of protein-protein interactions, designed to localise
signaling proteins to appropriate subcellular locations or to determine the enzyme
specificity. A well studied example is the family of protein kinases in relation to their
substrates. Typically, protein-protein interaction domains are relatively small and
independently folding modules of 35-150 amino acids, that are able to fold in isolation
from their host proteins while retaining their intrinsic ability to bind their physiological
partners. Their N- and C-termini are usually close together in space, while their ligand-
binding surface lies on the opposite face of the domain. This arrangement facilitates their
evolutionary use as a modular plug-in since it allows the domain to be easily inserted into
a host protein while projecting its ligand-binding site to contact other polypeptides.
Protein-protein interaction domains can be clustered into separate families, based either
on their sequence or ligand-binding properties. Interaction domains are often used
repeatedly by a large number of multi-domain proteins that display very different domain
structures, where they mediate a particular type of molecular recognition.
1. An important class of interaction domains recognises peptides that have been
post-translationally modified. These modifications include phosphorylation,
acetylation or methylation. The dynamic control of cellular behavior exerted by
covalent protein modifications therefore appears largely mediated by interaction
domains that regulate the associations of signaling protein cascades.
i. An important member of this class is the SH2 domain, which is
anticipated to be encoded in the human genome at least 120 times.
A large number of cytoplasmic proteins contain one or two SH2
domains that directly recognize phosphotyrosine-containing motifs
that are found for example on surfaces of activated receptors for
cytokines, growth factors and antigens. SH2 domains all recognize
phosphotyrosines, but have different preferences for amino acids
immediately following the phosphorylated residue, allowing a
degree of specificity in signaling by tyrosine kinases.
ii. Phosphotyrosine-containing motifs are also recognized by the PTB
domain, a very different structure than the SH2 domain. PTB
domains are found as constituents of binding proteins such as the
IRS-1 substrate of the insulin receptor.
iii. Phosphoserine and phosphothreonine motifs are targeted by a
growing family of interaction domains, including FHA domains,
14-3-3 proteins and WD40-repeat domains. Each of the
subfamilies and proteins contained therein bind specific
phosphoserine/threonine motifs, and thus mediate the biological
activities of protein-serine/threonine kinases.
b. In addition to phosphorylation, acetylation and methylation have recently
been implicated in inducing modular protein-protein interactions. For
example, acetylation or methylation of lysine residues on the surfaces of
histone structures respectively create effective binding sites for Bromo and
Chromo domain constituents of multi-domain proteins involved in
2. A large group of interaction domains bind proline-rich motifs, including SH3,
WW and EVH1 domains. The complexes that are recognised by these domains
are therefore not dependent on post-translational modifications, and hence are
more stable partners than those that, for example, need to be phosporylated to
enable binding by SH2 domains. SH3 domains interact with a large number of
different domain types and are therefore referred to as "promiscuous" domains.
3. PDZ domains bind the C-termini of their interaction partner domains, ion
channels and receptors, in a fashion that appears important for the localization of
their targets to particular subcellular sites, as well as for downstream signaling. In
addition to binding short C-terminal peptide motifs, PDZ domains can form
4. In addition to binding specific peptide motifs, a growing number of interaction
modules are known to recognize a variety of phospholipids, particularly
phosphoinositides (PI). For example, PH domains are constituents of lipid kinases
and phosphatases, where they affect cellular functioning by binding
promiscuously to either the phosphoinositide PI-4,5-P2 or PI-3,4,5-P3. Based on
their phospholipid-binding, PH domains can localise signaling proteins to specific
subregions of the plasma membrane, where they regulate the enzymatic activities
of their host proteins.
A predominant feature of interaction domains is their apparent versatility. Though PTB
domains were originally discovered based on their ability to bind phosphorylated
tyrosines within Asn-Pro-X-Tyr (NPXY) motifs, many PTB domains appear to be able
also to recognize unphosphorylated NPXY-related peptides. It therefore appears that PTB
domains originally evolved to bind unphosphorylated peptides, but have subsequently
developed a capacity to recognize phosphotyrosine in relatively few specific cases.
Furthermore, some individual PTB domains, e.g. those within the FRS-2 and Numb
proteins, have evolved to recognize two quite different peptide ligands.
Although many interaction domains bind very different substrates, a number of them
have similar tertiary structures, such as for example the PTB, EVH1 and PH domains,
which respectively recognize phosphotyrosine-containing peptide motifs, proline-rich
peptides and phospholipids (vide supra). The PTB/PH/EVH1 fold is shared by other
types of interaction domains as well. It seems therefore that this domain fold provides a
stable framework that can be used for modulating very many different intermolecular
interactions. On the other hand, many signaling proteins contain a number of different
interaction domains to yield a multi-domain protein that can mediate multiple protein-
protein and protein-phospholipid interactions. This modular organization of signaling
proteins can target proteins to the appropriate site within the cell, and direct binding to
cell surface receptors and downstream targets.
A number of bioinformatics approaches have been developed for the prediction of
protein–protein interactions based on the correspondence between the phylogenetic trees
of interacting proteins (Goh et al., 2000; Pazos and Valencia, 2001) and detection of
correlated mutations between pairs of proteins (Pollock and Taylor, 1997; Pazos et al.,
1997: Pollock et al., 1999).
Spacer domains and domain linkers
A number of protein domains are thought to have a purely structural role, allowing
functional domains to be advantageously situated or providing the overall protein with
the necessary flexibility. It must be kept in mind that a domain may be considered to have
a purely structural role simply because knowledge about its true function is missing. The
immunoglobulin (Ig) family provides examples of this, as the Ig constant heavy 1 and 2
domains and the Ig constant light domain act as “spacer” domains to allow the Ig variable
light and heavy domains to function optimally.
Multi-functional enzymes are generally composed of a number of discrete domains that
are connected by inter-domain linkers. A number of site-directed mutagenesis
experiments have shown that linkers play a crucial role in the enzymatic activity of multi-
domain complexes (George and Heringa, 2003). In general, mutating or altering the
length of linkers has been shown to affect protein stability, folding rates and domain-
domain orientation. The linkers provide dynamic structural bridges for the movement of
ligands across the modules. Many linkers act as rigid spacers separating two domains,
and provide a scaffold to prevent unfavorable interactions between folding domains.
George and Heringa (2003) found that the two main types of linker in natural proteins:
helical linkers and non-helical linkers. Helical linkers derive their rigidity from their
helical conformation, while non-helical linkers are rich in proline residues, which also
leads to structural rigidity. They should be flexible, keeping domains apart while
allowing them to move as part of their catalytic function. Linkers should furthermore be
invulnerable to host proteases, as they are often the targets for degradation.
Domain motions are important for a large repertoire of functions including ligand
binding, catalysis, regulation, formation of protein assemblies and transport of
metabolites (Gerstein et al., 1994). Often, the binding site is situated at the interface of a
pair of domains in the ‘closed’ conformation, corresponding to the inactive form of the
associated multi-domain enzyme, while the ‘open’ conformation of the domains exposes
the binding site and thus provides the active form of the enzyme. Domain motions are
often responsible for the mechanism of ‘induced fit’ observed in protein-protein
Protein flexibility often corresponds to large relative movements of domains. In aspartic
proteinases, for example, the two main domains forming the lobes of the structure show
large relative shifts that are crucial for the activity of the constituent enzyme (Sali et al.,
1992). Another example is the so-called swivelling domain in pyruvate phosphate
dikinase (Herzberg et al., 1996), which brings an intermediate enzymatic product over
about 45 Å from the active site of one domain to that of another. Such movements have
been determined by a conformational analysis of protein structures, where the
identification of for example hinge axes and their corresponding rotation angles permits a
useful representation of protein domain movements (Gerstein et al., 1994).
Three main mechansisms of domain movements are discernible (Gerstein et al., 1994):
1. Intrinsic flexibility: These movements leads to deformations of the domain
structure. Several small movements in secondary structures which have a
cumulative effect. These motions are due to small changes in backbone torsion
angles of strands and/or helices, kinks involving prolines in helices as in
adenylate kinase (Gerstein et al., 1994), or interconversion of helix and extended
conformations as in calmodulin (Ikura et al. 1992).
2. Hinged domain movements: Of special interest are hinge-bending movements,
where rigid domains are connected by flexible joints which tether the domains
and constrain their movement. Hinge-bending is believed to allow an induced fit
of molecular surfaces in protein assembly and ligand docking. These movements
are caused by rotations of a domain around a pivotal point, often afforded by a
linker region in between domains that changes its conformation to allow the
rotation. For example, to attain icosahedral symmetry, the S and P domains of
tobacco bushy stunt virus perform hinge movement mainly by dihedral angle
changes in the peptide that forms a linker between the two domains (Olsen et al.,
1983). In lactoferrin, the open and closed forms of the two domains involve a 53
degree rotation which is achieved by conformational changes within two strands
that connect the protein’s two domains (Jameson et al., 1998).
3. Ball-and-socket motion: The variable (light and heavy chain) domains of
antibodies rotate about 50 degrees with respect to the constant (light and heavy
chain) domains by means of a combination of hinge and shear motions (Lesk and
The method HINGEFIND (Wriggers and Schulten, 1997) identifies domain movements
and characterizes and visualizes the effective rotation axes (hinges). The program
compares a pair of known structures (e.g. two different crystal structures of a protein or
the results of molecular dynamics or Monte Carlo simulations) and partitions the protein
with a prespecified tolerance in preserved subdomains. It then determines effective
rotation axes which characterize the domain movements with respect to the reference
domain ("rigid core"). Hayward and Berendsen (1998) developed a method to analyse
comparative domain movements over larger numbers of structures, which is useful, for
example, for studying trajectories of molecular dynamics simulations.
The analysis and prediction methods that elucidate the evolutionary development and
function of domains and multi-domain complexes have recently been enriched by the
integration of genomic and dynamic contextual information. Many successes are to be
derived from the ongoing developments in spawning an integration web between
genomics, phylogenetic, sequence, structure, structural dynamics, and function data. This
will shed more light on the evolution of the structural and functional versatility of protein
domains, which will benefit biotechnical and biomedical applications.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W and Lipman DJ
(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25:3389-3402.
Apweiler R et al. (2000) InterPro - an integrated documentation resource for protein
families, domains and functional sites. Bioinformatics. 16, 1145-1150.
Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti
L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H,
Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder
NJ, Oinn TM, Pagni M, Servant F, Sigrist CJ, Zdobnov EM. (2001). The InterPro
database, an integrated documentation resource for protein families, domains and
functional sites. Nucleic Acids Res. 29, 37–40.
Ashburner M et al. (2000) Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nature Genet., 25, 25-29.
Bennet MJ, Schlunegger MP, Eisenberg D (1995) 3D domain swapping: a mechanism for
oligomeric assembly. Protein sci., 4, 2455-2468.
Coin L, Bateman A and Durbin R (2003) Enhanced protein domain discovery by using
language modeling techniques from speech recognition. Proc. Natl. Acad. Sci. USA, 100,
Coin L, Bateman A and Durbin R (2004) Enhanced protein domain discovery using
taxonomy. BMC Bioinformatics, 5, 56-65.
Dandekar T, Huynen M, Regula JT, Zimmermann CU, Ueberle B, Andrade MA, Doerks
T, Sánchez-Pulido L, Snel B, Suyama M, Yuan YP, Herrmann R and Bork P (2000) Re-
annotating the Mycoplasma pneumoniae genome sequence: adding value, function and
reading frames. Nucleic Acids Research, 28, 3278-3288.
Das S and Smith T (2000) Identifying nature’s protein LEGO set. Adv. Protein Chem.,
Davidson J, Chen K, Jamison R, Musmanno L and Kern C (1993) The evolutionary
history of the first three enzymes in pyrimidine biosynthesis, Bioessays, 15, 157-164.
George RA and Heringa J (2000) The REPRO server: finding protein internal sequence
repeats through the Web. Trends in Biochem. Sci. 25, 515-517.
George RA and Heringa J (2002) Protein domain identification and improved sequence
similarity searching using PSI-BLAST, Proteins Struct. Func. Gen. 48, 672-681.
George RA and Heringa J (2002) SnapDRAGON - a method to delineate protein
structural domains from sequence data, J. Mol. Biol. 316, 839-851.
George RA and Heringa J (2003) An analysis of protein domain linkers: their
classification and role in protein folding, Prot. Engineer. 15, 871-879.
George RA, Kleinjung J and Heringa J (2003) Predicting protein structural domains from
sequence data. In: Bioinformatics and Genomes - Current Perspectives. Pp. 1-26. Horizon
Scientific Press, Norfolk, England. ISBN 1-898486-47-6.
Gerstein M, Lesk A and Chothia C (1994). Structural mechanisms for domain
movements in proteins. Biochemistry 33, 6739-6749.
Goh CS, et al. (2000). Co-evolution of proteins with their interaction partners. J. Mol.
Biol. 299, 283–293.
Guo JT, Xu D, Kim D and Xu Y (2003) Improving the performance of DomainParser for
structural domain partition using neural network, Nucl. Acids Res., 31, 944-952.
Hadley C and Jones D (1999) A systematic comparison of protein structure
classifications: SCOP, CATH and FSSP. Structure Fold. Des., 7, 1099-1112.
Hayward S and Berendsen HJC (1998) Systematic Analysis of Domain Motions in
Proteins from Conformational Change; New Results on Citrate Synthase and T4
Lysozyme, Protein, Struct. Func. Genet., 30, 144-154.
Heger A and Holm L (2000) Automatic detection and alignment of repeats in protein
sequences. Prot. Struct. Func. Genet., 41, 224-237
Henikoff S, Greene E, Pietrokovski S, Bork P, Attwood T and Hood L (1997) Gene
families: the taxonomy of protein paralogs and chimeras, Science, 278, 609-614.
Heringa J and Taylor WR (1997) Three-dimensional domain duplication, swapping and
stealing. Curr. Opin. Struct. Biol., 7, 416-421.
Hinsen K, Thomas A, Field MJ (1999) Analysis of domain motions in large proteins,
Prot. Struct. Func. Genet., 34, 369-382.
Holm L and Sander C (1994) Parser for protein folding units. Proteins Struct. Func.
Genet., 19, 231-234.
Holm L and Sander C (1998) Dictionary of recurrent domains in protein structures. Prot.
Struct. Func. Genet., 33, 88-96.
Ikura M, Clore GM, Gronenborn AM, Zhu G, Klee CB and Bax A (1992) Solution
Structure of a Calmodulin-Target Peptide Complex by Multidimensional NMR. Science
Islam S, Luo J and Sternberg M (1995) Identification and analysis of domains in proteins.
Prot. Engin., 8, 513-525.
Jameson GB, Anderson BF, Norris GE., Thomas DH, Baker BN (1998) Structure of
human apolactoferrin at 2.0-A resolution. Refinement and analysis of ligand-induced
conformational change, Acta Crystallogr. D, 54, 1319-1335.
Janin J and Wodak S (1983) Structural domains in proteins and their role in the dynamics
of protein function. Prog. Biophys. Molec. Biol. , 42, 21-78.
Koonin EV, Mushegian AR, Bork P (1996). Non-orthologous gene displacement. Trends
Genet., 12, 334-336.
Lesk, AM and Chothia C (1988) Elbow motion in the immunoglobulins involves a
molecular ball-and-socket joint Nature, 355, 188-190.
Lo Conte L, Ailey B, Hubbard TJ, Chothia C (2000) A structural classification of
proteins database, Nucl.Acids Res., 28, 257-259.
Marcotte EM, Pellegrini M, Ng H-L, Rice DW, Yeates T and Eisenberg D (1999)
Detecting Protein Function and Protein-Protein Interactions from Genome Sequences,
Science, 285, 751-753.
Nicols WL, Rose GD, Ten Eyck LF and Zimm BH (1995) Rigid domains in proteins: an
algorithmic approach to their identification, Prot. Struct. Func. Genet., 23, 38-48.
Olsen AJ, Bricogne G and Harrison SC (1983). Structure of tomato bushy stunt virus IV.
The virus particle at 2.9 A resolution J. Mol. Biol. 171: 61-93.
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB and Thornton JM (1997)
CATH – a hierarchic classification of protein domain structures. Structure, 5, 1093-1108.
Pazos F, Valencia A (2001). Similarity of phylogenetic trees as indicator of protein-
protein interaction. Protein Eng. 14, 609–614.
Pazos F, et al. (1997). Correlated mutations contain information about protein-protein
interaction. J. Mol. Biol. 271, 511–523.
Pellegrini M., Marcotte EM, Thompson MJ, Eisenberg D and Yeates TO (1999)
Functions by Comparative Genome Analysis: Protein Phylogenetic Profiles, Proc. Natl.
Acad. Sci. USA, 96, 4285-4288.
Pollock, DD and Taylor WR (1997) Effectiveness of correlation analysis in identifying
protein residues undergoing correlated evolution. Prot. Engin., 10, 647-657.
Pollock, DD, Taylor WR and Goldman N (1999) Coevolving protein residues: maximum
likelihood identefication and relationship to structure. J. Mol. Biol., 287, 187-198.
Romein JW, Heringa J, and Bal HE (2003) A million-fold speed improvement in
genomic repeats detection, SuperComputing'03, Phoenix, AZ, November, 2003.
Schultz J, Copley RR, Doerks T, Ponting CP, and Bork P (2000) SMART: a web-based
tool for the study of genetically mobile domains. Nucleic Acids Res., 28, 231-234
Sidiqui A and Barton G (1995) Continuous and discontinuous domains – an algorithm for
the automatic generation of reliable protein domain definitions. Prot. Sci., 4, 872-884.
Sidiqui AS, Dengler U and Barton GJ (2001) 3Dee: a database of protein structural
domains. Bioinformatics, 17, 200-201.
Snel B, Lehmann G, Bork P, and Huynen MA (2000) STRING: A web-server to retrieve
and display the conserved neighbourhood of genes. Nucleic Acids Res., 28, 3442-3444.
Sowdhamini R and Blundell TL (1995) An automatic method involving cluster analysis
of secondary structures for the identification of domains in proteins, Prot. Sci., 4, 506-
Szklarczyk R and Heringa J (2004) Tracking repeats using significance and transitivity.
Bioinformatics 20 Suppl. 1, i311-i317.
Tsai CJ and Nussinov R (1997) Hydrophobic folding units derived from dissimilar
monomer structures and their interactions. Protein Sci., 6, 24-42.
Taylor WR (1999) Protein structure domain identification. Prot. Engin. 12, 203-216.
Taylor WR, Heringa J, Baud F and Flores TP (2002) A Fourier analysis of symmetry in
proteins, Protein Engin. 15, 79-89.
Wriggers W and Schulten K (1997) Protein Domain Movements: Detection of Rigid
Domains and Visualization of Effective Rotations in Comparisons of Atomic
Coordinates. Proteins Struct., Funct. Genet., 29, 1-14.
Xu Y, Xu D, Gabow HN (2000) Proetin domain-decomposition using a graph-theoretic
approach, Bioinformatics, 16, 1091-1104.
Zehfus M (1997) Identification of compact, hydrophobically stabilized domains and
modules containing multiple peptide chains. Protein Sci., 6, 1210-1219.
Table 1. WWW addresses of domain-related databases of interest
Sequence-based classification databases:
ProtFam List http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-family.html
Structural classification databases:
CATH http://www.biochem.ucl.ac.uk/bsm/cath new/index.html
Protein sequence clustering databases:
Domain linker databases:
Protein domain motions databases:
Figure 1. Schematic illustration of different types of backbone connectivity in
multidomain protein structures (right) and their coding peptide sequences (left). Note that
continuous domains are encoded by uninterrupted sequences, whereas discontinuous
domain sequences show inserted sequence fragments.
Figure 2. Pyruvate kinase structure and domain connectivity. (a) The structure shows
three domains: domain 1 (discontinuous) is a barrel regulatory domain; domain 2 is a
discontinuous / barrel catalytic substrate binding domain; and domain 3 corresponds to
a continuous / nucleotide binding domain. (b) Schematic representation of the
corresponding amino acid sequence, showing the segmental arrangement of the three
domains. The intercalated and interlaced structures have N- and C-termini indicated
(designated ‘N’ and ‘C’, respectively) for clarity.
= linker peptide
12 41 42 115 116 218 219 384 385 529