Document Sample
domain Powered By Docstoc
					                            Protein Domains

In: Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics,

Wiley Interscience (in press, 2005)

Jaap Heringa

Centre for Integrative Bioinformatics (IBIVU)

Faculty of Sciences / Faculty of Earth and Life Sciences

Vrije Universiteit Amsterdam, The Netherlands


Tel: +31 20 4447649

Fax: +31 20 4447653


Protein, domain, domain structure, protein module, domain motion, domain linker,

domain prediction, domain database


Many cellular processes involve proteins comprised of multiple domains. The modular

nature afforded by breaking up a protein’s three-dimensional structure into discrete

domains has many evolutionary advantages. First, it allows proteins to evolve to complex

cooperative functions that require elaborate, large and stable structural scaffolds, which

would be impossible to form otherwise within the structural constraints associated with

the protein folding process. Second, it can bring active sites in proximity in a fixed

stoichiometric ratio. Third, otherwise unstable intermediates can be accommodated

within inter-domain clefts. Fourth, the flexibility afforded by protein domain linker

regions, e.g. hinge regions, can greatly aid the activity of the cooperative catalytic

process carried out by a multi-domain enzyme. Fifth, the repeated and combinatorial use

of domains in multi-domain proteins has provided nature both an evolutionary economic

and versatile system to maintain and enhance cellular activity.


Protein domains are independent or semi-independent folding units of protein structure

that serve as structural building blocks of proteins. Their size varies from 36 amino acids

in the protein E-selectin to 692 residues in lipoxgenase-1 (Jones et al., 1998). Most

domains observed in nature have less than 200 residues, with an average domain size of

about 100 residues. While smaller domains of less than 50 residues are often structurally

reinforced by disulfide bonds or coordinated metal ions, large domains comprising more

than 300 residues are expected to contain multiple hydrophobic cores.

Domains are genetically mobile units, and multi-domain families are found throughout

the three kingdoms (Archaea, Bacteria and Eukarya). The majority of genomic proteins,

75% in unicellular organisms and more than 80% in metazoa, are multi-domain proteins

created as a result of gene duplication events (Apic et al., 2001). Genetic mechanisms

influencing the layout of multi-domain proteins include gross rearrangements such as

inversions, translocations, deletions and duplications, homologous recombination, and

slippage of DNA polymerase during replication (Bork et al., 1992). Although genetically

conceivable, the transition from two single domain proteins to a two-domain protein

requires that both domains fold correctly and, that they accomplish to bury a fraction of

the previously solvent-exposed surface area in a newly generated inter-domain surface,

the size of which is depending on the constraints imposed by the domain linker region

and the scale of domain interaction.

Protein structure is intrinsically hierarchic in its internal organization. It progresses from

secondary structure elements at the lowest level, via supersecondary structures and via

domains to complete proteins and assemblies of proteins at the highest level. At the

domain level, the connectivity of the polypeptide backbone between the domains appears

to be less important, as the protein retains a stable structure irrespective of the sequential

ordering of the domains. Moreover, evolution has led to various forms of domain

arrangements across the protein sequence that can be classified by their connectivity (Das

and Smith, 2000). These arrangements go beyond sequential permutations of complete

domain structures and can affect the internal domain organization. Figure 1 displays the

observed modes of connectivity between domains ordered by the degree of complexity.

Based on their connectivity, domains can be divided into two main classes: continuous

domains, where the associated coding sequence is an uninterrupted stretch of amino

acids, and discontinuous domains, where the coding sequence is interrupted by

subsequences encoding alternative structures, such as inserted domains. An example of a

multi-domain structure with a combination of continous and discontinuous domains is the

3-domain structure Pyruvate kinase, a member of the phosphotransferase family, of

which the structure and connectivity is given in Figure 2.

A protein module has been defined as a continuous domain structure that is structurally

independent enough to serve as a modular plug-in to many multi-domain proteins.

Modules are believed to have spread largely by exon shuffling and are observed in many

different multi-domain proteins with permuted or altogether different domain

combinations. The fact that modules often show the N- and C-termini in close proximity

appears to facilitate their addition to other protein structures, since their insertion in loop

regions would not require gross rearrangements.

A structurally viable mechanism for forming oligomeric domain assemblies is provided

by domain swapping (Bennett et al., 1995). In domain swapping, a secondary or tertiary

element of a monomeric protein is replaced by the same element of another protein.

Domain swapping can range from secondary structure elements to whole structural

domains in bringing together single-domain or multi-domain structures. It also represents

a model of evolution for functional adaptation by oligomerization, e.g. of oligomeric

enzymes that have their active site at sub-unit interfaces (Heringa and Taylor, 1997).

Predicting domain boundaries in multidomain proteins

A large number of structural classification methods have been developed to group the

large number of protein domain folds deposited in the Protein Data Bank (PDB). The

most widely used structural databases are CATH (Orengo et al., 1997), SCOP (Lo Conte

et al., 2000), FSSP (Holm and Sander, 1996) and 3Dee (Siddiqui et al., 2001). Links to

web servers for these databases are provided in Table 1. For each database, a unique

protocol is followed to classify the protein structures at the domain level. Since a

common operational definition of a domain is that of a compact globular substructure

with more interactions within itself than with the rest of the structure (Janin and Wodak,

1983), two characteristics are commonly exploited in boundary prediction methods:

extent of isolation and compactness (Tsai and Nussinov, 1997). Measures of local

compactness have been used in a number of recent domain assignment methods (Holm

and Sander, 1994; Islam et al., 1995; Sowdhamini and Blundell, 1995; Siddiqui and

Barton, 1995; Nicols et al., 1995; Zehfus, 1997; Hinsen et al., 1999; Taylor, 1999; Xu et

al., 2000; Siddiqui et al., 2001; Guo et al., 2003), albeit these approaches generally

cannot handle discontinuous or closely interacting domains very well. As a result, Hadley

and Jones (1999) observed discrepancies between assignments made by the above

structural domain databases.

A large number of profile and HMM methods are available to delineate the domain

structure in a query sequence. These methods make use of domain classifications in

protein domain sequence databases, and use the deposited sequences of each domain

family to derive a profile or HMM. Each profile or HMM is then used to scan the query

sequence, after which the highest scoring ones are used to delineate the putative domains

and their boundaries. Recent developments for these search techniques is the addition of

contextual information such as co-occurrence of domains in a sequence (Coin et al.,

2003) and the taxonomic distribution (Coin et al., 2004).

Although a very difficult task, some ab initio computational methods are also available

for the prediction of domain boundaries from protein sequence information alone. The

method SnapDRAGON (George and Heringa, 2002a) is primarily based on the

hydrophobic collapse during folding and folds a 3D protein structure based on the notion

of conserved hydrophobicity and information from secondary structure prediction

methods. The method takes as input a multiple alignment of the query sequence and a set

of homologues, as well as the predicted secondary structure. It then builds 100 models of

the protein structure, which each are assigned domain boundaries using the

aforementioned method of Taylor (1999). A final prediction of the domain boundaries is

achieved by assessing the consistency of the domain boundaries over the 100 predicted

folds and by determining statistically significant boundaries .

Domain boundary identification by comparative sequence analysis is performed by the

method DOMAINATION (George and Heringa, 2002b), which is based on the homology

search method PSI-BLAST (Altschul et al., 1997) to identify homologues to a query

sequence. The distribution of N- and C-terminal positions of the local alignments

obtained by PSI-BLAST is used to identify putative domain boundaries by a process of

chopping and joining domains and domain segments in an attempt to reconstruct the

evolutionary loss and gain of domains. The method follows an iterative protocol of on-

the-fly cutting and joining of domain fragments and subjecting newly formed fragments

to PSI-BLAST. It is able to recognize continuous and discontinuous domains and

increases the accuracy of PSI-BLAST by about 15%.

Domain fusion

Domains in multi-domain structures are likely to have once existed as independent

proteins, and many domains in eukaryotic multi-domain proteins can be found as

independent proteins in prokaryotes (Davidson et al., 1993). As an example of gene

fusion, vertebrates have a multi-enzyme protein (GARs-AIRs-GARt) comprising the

enzymes glycinamide ribonucleotide synthetase (GARs), aminoimidazole ribonucleotide

synthetase (AIRs), and GAR transformylase (GARt) 1. In insects, the corresponding

polypeptide encodes the four-domain structure GARs-(AIRs)2-GARt. However, GARs-

AIRs is encoded separately from GARt in yeast, and in bacteria each domain is encoded

separately (Henikoff et al., 1997).

Marcotte et al. (1999) introduced the ‘Rosetta Stone’ computational method for inferring

protein function and interactions from genome sequences based on the observation that

some pairs of interacting proteins have homologs in another organism fused into a single

protein chain. A comparison of sequence homologs from multiple organisms can reveal

these fused sequences, which are called Rosetta Stone sequences because the existence of

these multi-domain proteins implies the putative interactions between the separate protein

pairs in other organisms.

Another genomic method to predict domain function is phylogenetic profiling (Pellegrini

et al., 1999). The method checks for simultaneous presence or absence of pairs of genes

across a wide spectrum of genomes, and thereby reveals sets of proteins that are likely to

have coevolved and thus can be expected to act in the same cellular process.

The protein phylogenetic profiling and the Rosetta Stone methods have been very useful

in predicting the function of large numbers of genes without experimental functional


Domain evolution and gene duplication

The evolution and spread of domains through different protein families is likely to be the

result of gene duplication, leading to one or more paralogous genes. After such

duplication events, the selection pressure becomes less for the paralogous copies of the

gene, so that these are freer to manufacture adaptive functions over time. For more

recently evolved protein domains, a correspondence between the sequential boundaries of

exons and those of protein domains is observed. It is therefore believed that exon

shuffling during evolution has enabled organisms to develop new proteins with new

functions by the addition of exons from other parts of the genome that form new domains

as part of an existing protein.

A multi-domain architecture offers the potential of reuse trough domain repeats. Domain

duplication is essential for the formation of the active sites of enzymes such as in the

aspartic   and serine proteinases, porphobilinogen deaminase or the binding sites of

transport proteins such as lactoferrin, transferrin and sulfate binding protein. Domain

repeats also result in duplication of active sites or binding sites such as the calcium

binding sites in calmodulin or the 9 DNA-binding zinc fingers of the transcription factor


Recent methods to delineate repeats in protein sequences are RADAR (Holm and Heger,

2000), REPRO (George and Heringa, 2000; Romein et al., 2003) and TRUST

(Szklarczyk and Heringa, 2004). Taylor et al. (2002) devised a method based on Fourier

analysis to automatically recognize repeated substructures in protein structures.

Non-orthologous displacement

An increasing number of cases are identified of non-orthologous displacement (Koonin et

al., 1996), where enzymes carrying out an identical function in different organisms

belong to entirely different protein families, and thus are not expected to show any

sequence similarity. Examples of non-orthologous displacement include ornithine

decarboxylase in E. coli and S. serevisiae, where the isozymes speF and speC are

responsible for this function in E. coli and share the same structure comprising three

domains (ornithine decarboxylase N-terminal “wing” domain, PLP-dependent transferase

and ornithine decarboxylase C-terminal domain). The corresponding enzyme spe1 in S.

cerevisiae is a two-domain protein with entirely different domain structures (PLP-binding

barrel and alanine racemase-like domain). A problem with non-orthologous displacement

is that it cannot be recognised by sequence comparison methods, which are aimed at

recognising divergent evolution by mutation and insertions/deletions, including changes

of gene structure by gene fusion or fission. However, the techniques are not able to trace

evolutionary cases of horizontal gene transfer or functional displacement of one gene by

another one within a genome.

Suppose that a set of proteins is studied that correspond to a linear metabolic pathway,

and that all proteins are found for a given organism, except for one corresponding to an

intermediate step. It can then be expected that the missing activity is performed by some

hitherto undetected protein, in which case sequence comparison techniques can lead to

the identification of the missing protein by a homology search using a homologous

domain sequence in another genome that has the appropriate functional annotation. In the

absence of any putative homolog, non-orthologous gene displacement could be

suspected, in which case direct sequence analysis is of no use. However, functional

ontologies, such as GO (Ashburner et al., 2000), can aid the process, in that various non-

orthologous sequences can be identified using a GO functional descriptor. This leads to a

direct candidate whenever a protein domain is identified in the target genome. Otherwise,

if the ontology identifies a putative non-orthologous gene with the desired function in

another genome, a homology search using the latter gene can reveal putative non-

orthologous genes for the intermediate metabolic step in the query organism.

In the case of a prokaryotic organism, the fact can be exploited that gene sequences in

close proximity in the genome tend to be functionally related. Given a confirmed protein

domain sequence associated with the above metabolic pathway, the neighbourhood of its

homologs in complete prokaryotic genomes can be analysed using the STRING server

(Snel et al., 2000). The latter server has been used, for example, for the re-annotation of

the Mycoplasma pneumoniae genome (Dandekar, et al., 2000).

Homology regions

As a result of a homology search operation, for example by using the program PSI-

BLAST (Altschul et al., 1997), regions with significant similarity between otherwise

unrelated proteins in different species can abound. Such so-called homology regions can

be sustained within different sequence contexts, and often add a specific functionality to

their constituent proteins. In some cases, the function of regional homology domains can

be inferred by comparing the properties of the host proteins, when experimental

functional annotation is available. However, also the detection of these domains in

uncharacterized sequences can provide valuable clues for the protein’s function. In other

cases where the biological role of a local homology domain is not known, the delineation

and comparison of these regions might still be useful e.g. for finding conserved residues

as targets for mutagenesis or for structural studies.

Assessing functional specificity of domain families

Predicting or assessing the functional specificity of a given domain sequence is possible

when a set of homologues is available for a query sequence, and the homologues are

suspected to have a roughly identical function but with subtle variations (e.g.

dehydrogenase domain structures acting on different substrates). Insight can then be

gained by aligning the members of the family and deriving a phylogenetic tree. If the

family is organised into a tree that correlates well with the functions described (or

derived) for the proteins, then one can draw relatively reliable conclusions regarding the

query’s function. If the query falls into a clear subgroup of the tree, one can assume as a

function that of the branch. If, however, the functional annotation of the branch holding

the query is heterogeneous, or the query falls in between two branches, then one should

assume a less specific description in accordance with the neighbouring branches (e.g.

“alcohol dehydrogenase”), or resort to the generic functional description of the whole

family (e.g. “dehydrogenase”).

The transfer of function from one protein to another should only be done when the

protein sequences match over their total lengths in the pairwise sequence alignment,

without any large unaligned sequence fragments that might correspond to an additional

domain. If some fragments do not align, one has to carefully check whether the function

that is transferred really corresponds to the part that is common. The use of domain

databases and methods for domain identification (vide supra) can be of great help at this


If a query sequence contains a domain that is not matched in the pairwise alignment, this

should be noted. Conversely, if the protein used as source of the annotation contains an

unmatched domain, the annotation should just state the presence of the domain(s) of the

query without transferral of the complete functional description of the source.

Protein sequence clustering

Protein sequence cluster databases group related proteins together and are derived

automatically from protein sequence databases using different clustering algorithms.

Each database uses a different clustering algorithm aimed to group the sequences

generally in homologous families and superfamilies. Assignment of a query sequence

without annotation to any of the clusters can provide valuable clues about the protein’s

function. Since they are derived automatically from protein sequence databases without

manual crafting and validation of family discriminators, these databases are relatively

comprehensive, although the biological relevance of clusters can be ambiguous and can

sometimes be an artifact of particular thresholds. Multidomain protein sequences are a

particular source of error, in that the clustering algorithm can establish a link between

two sequences A and B based on a particular domain, while a third sequence C might be

linked to sequence B based on another domain sequence. If unchecked, this would lead to

an incorrect clustering of sequences A, B and C. Table 1 provides a number of links to

websites of protein sequence clustering databases.

Interaction domains

A large and growing body of evidence suggests that proteins involved in the regulation of

cellular events such as the cell cycle, signal transduction, protein trafficking, targeted

proteolysis, cytoskeletal organization and gene expression are typically multi-domain

proteins comprising a catalytic domain and one or more interaction domains. Interaction

domains induce the interaction of signaling polypeptides and specific multi-protein

complexes. For example, they can couple a cell surface receptor to an intracellular

biochemical pathway that controls the cellular responses to an external signal. Interaction

domains can be involved in a series of protein-protein interactions, designed to localise

signaling proteins to appropriate subcellular locations or to determine the enzyme

specificity. A well studied example is the family of protein kinases in relation to their

substrates. Typically, protein-protein interaction domains are relatively small and

independently folding modules of 35-150 amino acids, that are able to fold in isolation

from their host proteins while retaining their intrinsic ability to bind their physiological

partners. Their N- and C-termini are usually close together in space, while their ligand-

binding surface lies on the opposite face of the domain. This arrangement facilitates their

evolutionary use as a modular plug-in since it allows the domain to be easily inserted into

a host protein while projecting its ligand-binding site to contact other polypeptides.

Protein-protein interaction domains can be clustered into separate families, based either

on their sequence or ligand-binding properties. Interaction domains are often used

repeatedly by a large number of multi-domain proteins that display very different domain

structures, where they mediate a particular type of molecular recognition.

   1. An important class of interaction domains recognises peptides that have been

       post-translationally modified. These modifications include phosphorylation,

       acetylation or methylation. The dynamic control of cellular behavior exerted by

covalent protein modifications therefore appears largely mediated by interaction

domains that regulate the associations of signaling protein cascades.

   a. Phosphorylation:

           i. An important member of this class is the SH2 domain, which is

               anticipated to be encoded in the human genome at least 120 times.

               A large number of cytoplasmic proteins contain one or two SH2

               domains that directly recognize phosphotyrosine-containing motifs

               that are found for example on surfaces of activated receptors for

               cytokines, growth factors and antigens. SH2 domains all recognize

               phosphotyrosines, but have different preferences for amino acids

               immediately following the phosphorylated residue, allowing a

               degree of specificity in signaling by tyrosine kinases.

           ii. Phosphotyrosine-containing motifs are also recognized by the PTB

               domain, a very different structure than the SH2 domain. PTB

               domains are found as constituents of binding proteins such as the

               IRS-1 substrate of the insulin receptor.

          iii. Phosphoserine and phosphothreonine motifs are targeted by a

               growing family of interaction domains, including FHA domains,

               14-3-3 proteins and WD40-repeat domains.                  Each of the

               subfamilies   and    proteins   contained    therein   bind   specific

               phosphoserine/threonine motifs, and thus mediate the biological

               activities of protein-serine/threonine kinases.

       b. In addition to phosphorylation, acetylation and methylation have recently

          been implicated in inducing modular protein-protein interactions. For

          example, acetylation or methylation of lysine residues on the surfaces of

          histone structures respectively create effective binding sites for Bromo and

          Chromo domain constituents of multi-domain proteins involved in

          chromatin remodeling.

2. A large group of interaction domains bind proline-rich motifs, including SH3,

   WW and EVH1 domains. The complexes that are recognised by these domains

   are therefore not dependent on post-translational modifications, and hence are

   more stable partners than those that, for example, need to be phosporylated to

   enable binding by SH2 domains. SH3 domains interact with a large number of

   different domain types and are therefore referred to as "promiscuous" domains.

3. PDZ domains bind the C-termini of their interaction partner domains, ion

   channels and receptors, in a fashion that appears important for the localization of

   their targets to particular subcellular sites, as well as for downstream signaling. In

   addition to binding short C-terminal peptide motifs, PDZ domains can form


4. In addition to binding specific peptide motifs, a growing number of interaction

   modules are known to recognize a variety of phospholipids, particularly

   phosphoinositides (PI). For example, PH domains are constituents of lipid kinases

   and phosphatases, where they affect cellular functioning by binding

   promiscuously to either the phosphoinositide PI-4,5-P2 or PI-3,4,5-P3. Based on

   their phospholipid-binding, PH domains can localise signaling proteins to specific

       subregions of the plasma membrane, where they regulate the enzymatic activities

       of their host proteins.

A predominant feature of interaction domains is their apparent versatility. Though PTB

domains were originally discovered based on their ability to bind phosphorylated

tyrosines within Asn-Pro-X-Tyr (NPXY) motifs, many PTB domains appear to be able

also to recognize unphosphorylated NPXY-related peptides. It therefore appears that PTB

domains originally evolved to bind unphosphorylated peptides, but have subsequently

developed a capacity to recognize phosphotyrosine in relatively few specific cases.

Furthermore, some individual PTB domains, e.g. those within the FRS-2 and Numb

proteins, have evolved to recognize two quite different peptide ligands.

Although many interaction domains bind very different substrates, a number of them

have similar tertiary structures, such as for example the PTB, EVH1 and PH domains,

which respectively recognize phosphotyrosine-containing peptide motifs, proline-rich

peptides and phospholipids (vide supra). The PTB/PH/EVH1 fold is shared by other

types of interaction domains as well. It seems therefore that this domain fold provides a

stable framework that can be used for modulating very many different intermolecular

interactions. On the other hand, many signaling proteins contain a number of different

interaction domains to yield a multi-domain protein that can mediate multiple protein-

protein and protein-phospholipid interactions. This modular organization of signaling

proteins can target proteins to the appropriate site within the cell, and direct binding to

cell surface receptors and downstream targets.

A number of bioinformatics approaches have been developed for the prediction of

protein–protein interactions based on the correspondence between the phylogenetic trees

of interacting proteins (Goh et al., 2000; Pazos and Valencia, 2001) and detection of

correlated mutations between pairs of proteins (Pollock and Taylor, 1997; Pazos et al.,

1997: Pollock et al., 1999).

Spacer domains and domain linkers

A number of protein domains are thought to have a purely structural role, allowing

functional domains to be advantageously situated or providing the overall protein with

the necessary flexibility. It must be kept in mind that a domain may be considered to have

a purely structural role simply because knowledge about its true function is missing. The

immunoglobulin (Ig) family provides examples of this, as the Ig constant heavy 1 and 2

domains and the Ig constant light domain act as “spacer” domains to allow the Ig variable

light and heavy domains to function optimally.

Multi-functional enzymes are generally composed of a number of discrete domains that

are connected by inter-domain linkers. A number of site-directed mutagenesis

experiments have shown that linkers play a crucial role in the enzymatic activity of multi-

domain complexes (George and Heringa, 2003). In general, mutating or altering the

length of linkers has been shown to affect protein stability, folding rates and domain-

domain orientation. The linkers provide dynamic structural bridges for the movement of

ligands across the modules. Many linkers act as rigid spacers separating two domains,

and provide a scaffold to prevent unfavorable interactions between folding domains.

George and Heringa (2003) found that the two main types of linker in natural proteins:

helical linkers and non-helical linkers. Helical linkers derive their rigidity from their

helical conformation, while non-helical linkers are rich in proline residues, which also

leads to structural rigidity. They should be flexible, keeping domains apart while

allowing them to move as part of their catalytic function. Linkers should furthermore be

invulnerable to host proteases, as they are often the targets for degradation.

Domain movements

Domain motions are important for a large repertoire of functions including ligand

binding, catalysis, regulation, formation of protein assemblies and transport of

metabolites (Gerstein et al., 1994). Often, the binding site is situated at the interface of a

pair of domains in the ‘closed’ conformation, corresponding to the inactive form of the

associated multi-domain enzyme, while the ‘open’ conformation of the domains exposes

the binding site and thus provides the active form of the enzyme. Domain motions are

often responsible for the mechanism of ‘induced fit’ observed in protein-protein


Protein flexibility often corresponds to large relative movements of domains. In aspartic

proteinases, for example, the two main domains forming the lobes of the structure show

large relative shifts that are crucial for the activity of the constituent enzyme (Sali et al.,

1992). Another example is the so-called swivelling domain in pyruvate phosphate

dikinase (Herzberg et al., 1996), which brings an intermediate enzymatic product over

about 45 Å from the active site of one domain to that of another. Such movements have

been determined by a conformational analysis of protein structures, where the

identification of for example hinge axes and their corresponding rotation angles permits a

useful representation of protein domain movements (Gerstein et al., 1994).

Three main mechansisms of domain movements are discernible (Gerstein et al., 1994):

   1. Intrinsic flexibility: These movements leads to deformations of the domain

       structure. Several small movements in secondary structures which have a

       cumulative effect. These motions are due to small changes in backbone torsion

       angles of strands and/or helices, kinks involving prolines in helices as in

       adenylate kinase (Gerstein et al., 1994), or interconversion of helix and extended

       conformations as in calmodulin (Ikura et al. 1992).

   2. Hinged domain movements: Of special interest are hinge-bending movements,

       where rigid domains are connected by flexible joints which tether the domains

       and constrain their movement. Hinge-bending is believed to allow an induced fit

       of molecular surfaces in protein assembly and ligand docking. These movements

       are caused by rotations of a domain around a pivotal point, often afforded by a

       linker region in between domains that changes its conformation to allow the

       rotation. For example, to attain icosahedral symmetry, the S and P domains of

       tobacco bushy stunt virus perform hinge movement mainly by dihedral angle

       changes in the peptide that forms a linker between the two domains (Olsen et al.,

       1983). In lactoferrin, the open and closed forms of the two domains involve a 53

       degree rotation which is achieved by conformational changes within two strands

       that connect the protein’s two domains (Jameson et al., 1998).

   3. Ball-and-socket motion: The variable (light and heavy chain) domains of

       antibodies rotate about 50 degrees with respect to the constant (light and heavy

       chain) domains by means of a combination of hinge and shear motions (Lesk and

       Chothia, 1988).

The method HINGEFIND (Wriggers and Schulten, 1997) identifies domain movements

and characterizes and visualizes the effective rotation axes (hinges). The program

compares a pair of known structures (e.g. two different crystal structures of a protein or

the results of molecular dynamics or Monte Carlo simulations) and partitions the protein

with a prespecified tolerance in preserved subdomains. It then determines effective

rotation axes which characterize the domain movements with respect to the reference

domain ("rigid core"). Hayward and Berendsen (1998) developed a method to analyse

comparative domain movements over larger numbers of structures, which is useful, for

example, for studying trajectories of molecular dynamics simulations.


The analysis and prediction methods that elucidate the evolutionary development and

function of domains and multi-domain complexes have recently been enriched by the

integration of genomic and dynamic contextual information. Many successes are to be

derived from the ongoing developments in spawning an integration web between

genomics, phylogenetic, sequence, structure, structural dynamics, and function data. This

will shed more light on the evolution of the structural and functional versatility of protein

domains,   which   will   benefit   biotechnical   and   biomedical   applications.


Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W and Lipman DJ

(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs. Nucleic Acids Res. 25:3389-3402.

Apweiler R et al. (2000) InterPro - an integrated documentation resource for protein

families, domains and functional sites. Bioinformatics. 16, 1145-1150.

Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti

L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H,

Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder

NJ, Oinn TM, Pagni M, Servant F, Sigrist CJ, Zdobnov EM. (2001). The InterPro

database, an integrated documentation resource for protein families, domains and

functional sites. Nucleic Acids Res. 29, 37–40.

Ashburner M et al. (2000) Gene ontology: tool for the unification of biology. The Gene

Ontology Consortium. Nature Genet., 25, 25-29.

Bennet MJ, Schlunegger MP, Eisenberg D (1995) 3D domain swapping: a mechanism for

oligomeric assembly. Protein sci., 4, 2455-2468.

Coin L, Bateman A and Durbin R (2003) Enhanced protein domain discovery by using

language modeling techniques from speech recognition. Proc. Natl. Acad. Sci. USA, 100,


Coin L, Bateman A and Durbin R (2004) Enhanced protein domain discovery using

taxonomy. BMC Bioinformatics, 5, 56-65.

Dandekar T, Huynen M, Regula JT, Zimmermann CU, Ueberle B, Andrade MA, Doerks

T, Sánchez-Pulido L, Snel B, Suyama M, Yuan YP, Herrmann R and Bork P (2000) Re-

annotating the Mycoplasma pneumoniae genome sequence: adding value, function and

reading frames. Nucleic Acids Research, 28, 3278-3288.

Das S and Smith T (2000) Identifying nature’s protein LEGO set. Adv. Protein Chem.,

54, 159-183.

Davidson J, Chen K, Jamison R, Musmanno L and Kern C (1993) The evolutionary

history of the first three enzymes in pyrimidine biosynthesis, Bioessays, 15, 157-164.

George RA and Heringa J (2000) The REPRO server: finding protein internal sequence

repeats through the Web. Trends in Biochem. Sci. 25, 515-517.

George RA and Heringa J (2002) Protein domain identification and improved sequence

similarity searching using PSI-BLAST, Proteins Struct. Func. Gen. 48, 672-681.

George RA and Heringa J (2002) SnapDRAGON - a method to delineate protein

structural domains from sequence data, J. Mol. Biol. 316, 839-851.

George RA and Heringa J (2003) An analysis of protein domain linkers: their

classification and role in protein folding, Prot. Engineer. 15, 871-879.

George RA, Kleinjung J and Heringa J (2003) Predicting protein structural domains from

sequence data. In: Bioinformatics and Genomes - Current Perspectives. Pp. 1-26. Horizon

Scientific Press, Norfolk, England. ISBN 1-898486-47-6.

Gerstein M, Lesk A and Chothia C (1994). Structural mechanisms for domain

movements in proteins. Biochemistry 33, 6739-6749.

Goh CS, et al. (2000). Co-evolution of proteins with their interaction partners. J. Mol.

Biol. 299, 283–293.

Guo JT, Xu D, Kim D and Xu Y (2003) Improving the performance of DomainParser for

structural domain partition using neural network, Nucl. Acids Res., 31, 944-952.

Hadley C and Jones D (1999) A systematic comparison of protein structure

classifications: SCOP, CATH and FSSP. Structure Fold. Des., 7, 1099-1112.

Hayward S and Berendsen HJC (1998) Systematic Analysis of Domain Motions in

Proteins from Conformational Change; New Results on Citrate Synthase and T4

Lysozyme, Protein, Struct. Func. Genet., 30, 144-154.

Heger A and Holm L (2000) Automatic detection and alignment of repeats in protein

sequences. Prot. Struct. Func. Genet., 41, 224-237

Henikoff S, Greene E, Pietrokovski S, Bork P, Attwood T and Hood L (1997) Gene

families: the taxonomy of protein paralogs and chimeras, Science, 278, 609-614.

Heringa J and Taylor WR (1997) Three-dimensional domain duplication, swapping and

stealing. Curr. Opin. Struct. Biol., 7, 416-421.

Hinsen K, Thomas A, Field MJ (1999) Analysis of domain motions in large proteins,

Prot. Struct. Func. Genet., 34, 369-382.

Holm L and Sander C (1994) Parser for protein folding units. Proteins Struct. Func.

Genet., 19, 231-234.

Holm L and Sander C (1998) Dictionary of recurrent domains in protein structures. Prot.

Struct. Func. Genet., 33, 88-96.

Ikura M, Clore GM, Gronenborn AM, Zhu G, Klee CB and Bax A (1992) Solution

Structure of a Calmodulin-Target Peptide Complex by Multidimensional NMR. Science

256, 632-638.

Islam S, Luo J and Sternberg M (1995) Identification and analysis of domains in proteins.

Prot. Engin., 8, 513-525.

Jameson GB, Anderson BF, Norris GE., Thomas DH, Baker BN (1998) Structure of

human apolactoferrin at 2.0-A resolution. Refinement and analysis of ligand-induced

conformational change, Acta Crystallogr. D, 54, 1319-1335.

Janin J and Wodak S (1983) Structural domains in proteins and their role in the dynamics

of protein function. Prog. Biophys. Molec. Biol. , 42, 21-78.

Koonin EV, Mushegian AR, Bork P (1996). Non-orthologous gene displacement. Trends

Genet., 12, 334-336.

Lesk, AM and Chothia C (1988) Elbow motion in the immunoglobulins involves a
molecular ball-and-socket joint Nature, 355, 188-190.

Lo Conte L, Ailey B, Hubbard TJ, Chothia C (2000) A structural classification of

proteins database, Nucl.Acids Res., 28, 257-259.

Marcotte EM, Pellegrini M, Ng H-L, Rice DW, Yeates T and Eisenberg D (1999)

Detecting Protein Function and Protein-Protein Interactions from Genome Sequences,

Science, 285, 751-753.

Nicols WL, Rose GD, Ten Eyck LF and Zimm BH (1995) Rigid domains in proteins: an

algorithmic approach to their identification, Prot. Struct. Func. Genet., 23, 38-48.

Olsen AJ, Bricogne G and Harrison SC (1983). Structure of tomato bushy stunt virus IV.

The virus particle at 2.9 A resolution J. Mol. Biol. 171: 61-93.

Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB and Thornton JM (1997)

CATH – a hierarchic classification of protein domain structures. Structure, 5, 1093-1108.

Pazos F, Valencia A (2001). Similarity of phylogenetic trees as indicator of protein-

protein interaction. Protein Eng. 14, 609–614.

Pazos F, et al. (1997). Correlated mutations contain information about protein-protein

interaction. J. Mol. Biol. 271, 511–523.

Pellegrini M., Marcotte EM, Thompson MJ, Eisenberg D and Yeates TO (1999)

Functions by Comparative Genome Analysis: Protein Phylogenetic Profiles, Proc. Natl.

Acad. Sci. USA, 96, 4285-4288.

Pollock, DD and Taylor WR (1997) Effectiveness of correlation analysis in identifying

protein residues undergoing correlated evolution. Prot. Engin., 10, 647-657.

Pollock, DD, Taylor WR and Goldman N (1999) Coevolving protein residues: maximum

likelihood identefication and relationship to structure. J. Mol. Biol., 287, 187-198.

Romein JW, Heringa J, and Bal HE (2003) A million-fold speed improvement in

genomic repeats detection, SuperComputing'03, Phoenix, AZ, November, 2003.

Schultz J, Copley RR, Doerks T, Ponting CP, and Bork P (2000) SMART: a web-based

tool for the study of genetically mobile domains. Nucleic Acids Res., 28, 231-234

Sidiqui A and Barton G (1995) Continuous and discontinuous domains – an algorithm for

the automatic generation of reliable protein domain definitions. Prot. Sci., 4, 872-884.

Sidiqui AS, Dengler U and Barton GJ (2001) 3Dee: a database of protein structural

domains. Bioinformatics, 17, 200-201.

Snel B, Lehmann G, Bork P, and Huynen MA (2000) STRING: A web-server to retrieve

and display the conserved neighbourhood of genes. Nucleic Acids Res., 28, 3442-3444.

Sowdhamini R and Blundell TL (1995) An automatic method involving cluster analysis

of secondary structures for the identification of domains in proteins, Prot. Sci., 4, 506-


Szklarczyk R and Heringa J (2004) Tracking repeats using significance and transitivity.

Bioinformatics 20 Suppl. 1, i311-i317.

Tsai CJ and Nussinov R (1997) Hydrophobic folding units derived from dissimilar

monomer structures and their interactions. Protein Sci., 6, 24-42.

Taylor WR (1999) Protein structure domain identification. Prot. Engin. 12, 203-216.

Taylor WR, Heringa J, Baud F and Flores TP (2002) A Fourier analysis of symmetry in

proteins, Protein Engin. 15, 79-89.

Wriggers W and Schulten K (1997) Protein Domain Movements: Detection of Rigid

Domains and Visualization of Effective Rotations in Comparisons of Atomic

Coordinates. Proteins Struct., Funct. Genet., 29, 1-14.

Xu Y, Xu D, Gabow HN (2000) Proetin domain-decomposition using a graph-theoretic

approach, Bioinformatics, 16, 1091-1104.

Zehfus M (1997) Identification of compact, hydrophobically stabilized domains and

modules containing multiple peptide chains. Protein Sci., 6, 1210-1219.

Table 1. WWW addresses of domain-related databases of interest

Name             WWW

Sequence-based classification databases:







ProtFam List



Structural classification databases:



CATH    new/index.html


Protein sequence clustering databases:





Domain linker databases:


Protein domain motions databases:



Figure 1. Schematic illustration of different types of backbone connectivity in

multidomain protein structures (right) and their coding peptide sequences (left). Note that

continuous domains are encoded by uninterrupted sequences, whereas discontinuous

domain sequences show inserted sequence fragments.

Figure 2. Pyruvate kinase structure and domain connectivity. (a) The structure shows

three domains: domain 1 (discontinuous) is a  barrel regulatory domain; domain 2 is a

discontinuous / barrel catalytic substrate binding domain; and domain 3 corresponds to

a continuous / nucleotide binding domain. (b) Schematic representation of the

corresponding amino acid sequence, showing the segmental arrangement of the three

domains. The intercalated and interlaced structures have N- and C-termini indicated

(designated ‘N’ and ‘C’, respectively) for clarity.

Figure 1

            Sequence              Structure


              Concatenated                          domains


                                              C   Discontinuous


           = linker peptide

         Figure 2.


                                                  Domain 1

                                                  Domain 2

                                                  Domain 3

         12   41 42   115 116        218   219               384   385   529

     N                                                                         C
                                Domain 3

                                Domain 2

                                 Domain 1


Shared By: