Computational approaches to study transcriptional regulation

Document Sample
Computational approaches to study transcriptional regulation Powered By Docstoc
					                               758   Biochemical Society Transactions (2008) Volume 36, part 4

                                     Computational approaches to study transcriptional
                                     M. Madan Babu1
                                     MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 0QH, U.K.

                                                      In recent years, a number of technical and experimental advances have allowed us to obtain an
                                                      unprecedented amount of information about living systems on a genomic scale. Although the complete
                                                      genomes of many organisms are available due to the progress made in sequencing technology, the

                                                      challenge to understand how the individual genes are regulated within the cell remains. Here, I provide an
                                                      overview of current computational methods to investigate transcriptional regulation. I will first discuss
                                                      how representing protein–DNA interactions as a network provides us with a conceptual framework
                                                      to understand the organization of regulatory interactions in an organism. I will then describe methods to
                                                      predict transcription factors and cis-regulatory elements using information such as sequence, structure
                                                      and evolutionary conservation. Finally, I will discuss approaches to infer genome-scale transcriptional
                                                      regulatory networks using experimentally characterized interactions from model organisms and by reverse-
                                                      engineering regulatory interactions that makes use of gene expression data and genomewide location
                                                      data. The methods summarized here can be exploited to discover previously uncharacterized transcriptional
                                                      pathways in organisms whose genome sequence is known. In addition, such a framework and approach can
                                                      be invaluable to investigate transcriptional regulation in complex microbial communities such as the human
                                                      gut flora or populations of emerging pathogens. Apart from these medical applications, the concepts and
                                                      methods discussed can be used to understand the combinatorial logic of transcriptional regulation and can
                                                      be exploited in biotechnological applications, such as in synthetic biology experiments aimed at engineering
                                                      regulatory circuits for various purposes.

                                     Introduction                                                                                  Organization of the transcriptional
                                     Regulation of gene expression at the transcriptional level is a                               regulatory network
                                     fundamental mechanism that is evolutionarily conserved in all
                                                                                                                                   Experiments performed over the previous years have
                                     the cellular systems [1]. This form of regulation is typically
Biochemical Society Transactions

                                                                                                                                   resulted in a large amount of information on protein–DNA
                                     mediated by TFs (transcription factors) that bind to DNA
                                                                                                                                   interactions and gene regulation in several model organisms
                                     and either activate or repress the expression of nearby genes
                                                                                                                                   [4–9]. In addition, advances in experimental techniques that
                                     [2,3]. Extensive research into understanding this process at
                                                                                                                                   detect protein–DNA interactions (see Table 1) have provided
                                     the biochemical and structural level has provided us with an
                                                                                                                                   us with evidence for TF–DNA interactions on a genomic scale
                                     understanding that now allows us to investigate this process
                                                                                                                                   [10–13]. This deluge of information is best represented as the
                                     on a genomic scale (see Table 1 for genome-scale experimental
                                                                                                                                   transcriptional regulatory network with nodes connected by
                                     strategies to probe protein–DNA interactions). These experi-
                                                                                                                                   edges (Figure 1) [13,14]. In such a network representation,
                                     mental approaches have been applied to several prokaryotes
                                                                                                                                   the nodes represent either TFs or TGs (target genes) and the
                                     and eukaryotes, resulting in a wealth of information that is
                                                                                                                                   directed edges represent a regulatory interaction (protein–
                                     stored in publicly available databases [4–7]. To take advantage
                                                                                                                                   DNA interaction) between the TFs and their TGs [13,14].
                                     of this deluge of information, computational approaches to
                                                                                                                                      Representing regulatory interactions as a network provides
                                     visualize, analyse and prioritize genes for further experi-
                                                                                                                                   us with a conceptual framework and the abstraction to
                                     mental characterization, or to understand general principles
                                                                                                                                   understand general principles of regulation on a genomic scale
                                     on a genomic scale, have become indispensable. In the present
                                                                                                                                   [15]. A number of recent studies on transcriptional networks
                                     mini-review, I will provide an overview of the key computa-
                                                                                                                                   of prokaryotes and eukaryotes have shown that the struc-
                                     tional approaches to represent this information and investi-
                                                                                                                                   ture of such networks can be investigated at least at three
                                     gate transcriptional regulation on a genomic scale.
                                                                                                                                   distinct levels of organization [14]. At the most basic level, the
                                     Key words: cis-regulatory element, computational approach, protein–DNA interaction, reverse
                                                                                                                                   network consists of a single regulatory interaction between a
                                     engineering, target gene, transcriptional regulation, transcription factor.                   TF and its TG (Figure 1a). At the local level of organization,
                                     Abbreviations used: ChIP, chromatin immunoprecipitation; Dam, DNA adenine methyltrans-        studies have uncovered that the basic unit is organized
                                     ferase; DamID, Dam identification; HMM, hidden Markov model; PSSM, position-specific scoring
                                     matrix; TF, transcription factor; TG, target gene.
                                                                                                                                   into fundamental units of transcriptional regulation, called
                                       email                                                              network motifs (Figure 1b) [16]. Finally, at the global level of

                                     C   The Authors Journal compilation C 2008 Biochemical Society                                               Biochem. Soc. Trans. (2008) 36, 758–765; doi:10.1042/BST0360758
                                                                                                  New Methods for the Study of Protein–Nucleic Acid Interactions     759

Table 1 Genome-scale experimental methods to probe protein–DNA interactions

Method                             Description

ChIP-chip and ChIP-seq             The DNA-binding protein is tagged with an epitope and is expressed in a cell. The bound protein is covalently
  experiments                         linked to DNA by using an in vivo cross-linking agent such as formaldehyde. After cross-linking, DNA is
                                      sheared and the protein–DNA complex is pulled down using an antibody for the tag. Reversal of the
                                      cross-link releases the bound DNA, allowing the sequence of the fragments to be determined by
                                      hybridization to a microarray (ChIP-chip) or by sequencing (ChIP-seq).
                                   In ChIP-chip experiments, intergenic regions are spotted on to a microarray chip. Following a ChIP step, the
                                      bound fragments are reverse cross-linked and hybridized on to the chip. Complementary sequences will bind
                                      to specific spots on the chip, thereby providing the exact intergenic region to which the protein was bound
                                   In ChIP-seq experiments, the bound fragments are directly sequenced using 454/Solexa/Illumina Sequencing
                                      Technology. The sequences are then computationally mapped back to the genome sequence. Fragments that
                                      were bound by the protein will be sequenced several times providing a direct measure of enrichment
DamID [Dam (DNA adenine            To overcome any potential non-specific cross-linking of protein to DNA as could happen with ChIP-chip
  methyltransferase)                 experiments, the DamID technique was introduced. The protein of interest is fused to an E. coli protein, Dam.
  identification]                      Dam methylates the N6 position of the adenine in the sequence GATC, which occurs at reasonably high
                                      frequency in any genome (∼1 site in 256 bases). Upon binding DNA, the Dam protein preferentially
                                      methylates adenine in the vicinity of binding. Subsequently, the genomic DNA is digested by the DpnI and
                                      DpnII restriction enzymes that cleave within the non-methylated GATC sequence, and remove fragments that
                                      are not methylated. The remaining methylated fragments are amplified by selective PCR and quantified
                                      using a microarray [63].
PBMs (protein-binding              In contrast with the methods described above, this is an in vitro method to probe protein–DNA interaction. A
  universal DNA microarrays)          DNA-binding protein of interest is epitope-tagged, purified and bound directly to a double-stranded DNA
                                      microarray spotted with a large number of potential binding sites. Labelling with fluorophore-conjugated
                                      antibody for the tag allows detection of binding sites from the significantly bound spots [64].

Figure 1 Organization of the transcriptional regulatory network                Local network structure
(a) The basic unit consists of a regulatory interaction (grey arrow)           A network motif is defined as a small pattern of interconnec-
between a TF (black circle) and a TG (grey circle). (b) The basic unit forms   tions that recur at many different parts of the network at
small patterns of regulatory interactions called network motifs. The three     frequencies much higher than what is expected by chance
most commonly occurring motifs are shown here: feed-forward motif              when compared with random networks of similar size [17].
(FFM, top), single input motif (SIM, middle) and multiple input motif          Analysis of the transcriptional networks of Escherichia coli
(MIM, bottom). (c) The set of all transcriptional regulatory interactions is   and yeast has revealed the presence of three commonly occur-
referred to as the transcriptional regulatory network.                         ring motifs, each of which has distinct kinetic properties
                                                                               in the control of gene expression [16]. These are (i) feed-
                                                                               forward motif (FFM; Figure 1b, top), where a top-level TF
                                                                               regulates both the intermediate-level TF and the TGs, and the
                                                                               intermediate-level TF regulates the TG. If both TFs are activ-
                                                                               ators, such a connectivity pattern might ensure that the TG is
                                                                               expressed only when persistent signal is received by the top-
                                                                               level TF. Since the concentration of the intermediate TF
                                                                               should be built up for the regulation of the final TG, random
                                                                               fluctuations and noise in activation of the top-level TF are
                                                                               filtered and do not get propagated. (ii) Single input motif
                                                                               (SIM; Figure 1b, middle), where a single TF regulates the ex-
organization, the set of all transcriptional regulatory inter-                 pression of several TGs simultaneously. Depending on the
actions in a cell form the global structure that has been                      promoter strength of the regulated genes, it may respond to
shown to have a scale-free topology (Figure 1c and see                         different concentration levels of the active TF [16]. Therefore,
the Global network structure section below) [14]. Several                      if the concentration of the active TF changes with time, such
computer programs are available to visualize large biological                  a motif could set a temporal pattern in the expression of the
networks and some of the commonly used software are listed                     individual targets. (iii) Multiple input motif (MIM; Figure 1b,
in Table 2.                                                                    bottom), where multiple TFs regulate the expression of

                                                                                                    C   The Authors Journal compilation C 2008 Biochemical Society
760   Biochemical Society Transactions (2008) Volume 36, part 4

      multiple TGs. Since the TFs could potentially respond to           proteins that sense the external milieu (e.g. two-component
      different signals, such motifs could therefore integrate diverse   signal transduction systems) [25–27]. Therefore one can
      signals and bring about differential expression of the relevant    use the information about the domain encoded to identify
      targets. Thus regulation of genes via network motifs provides      novel TFs in completely sequenced genomes using remote
      distinct ways of regulating gene expression. Since maintaining     homology detection methods.
      the right levels of TGs can affect fitness [18], it is very much      Proteins containing specific domains can be identified by
      possible that controlling gene expression via distinct motifs      several ways, which includes the profile-based methods such
      is advantageous under different conditions. Local network          as PSI-BLAST and HMM (hidden Markov model)-based
      properties such as motif identification can be carried out using   methods such as HMMER and SAM. Domain databases such
      programs such as Mfinder, FanMod and Cytoscape (Table 2).          as PFAM database (Protein Families Database of Alignments
                                                                         and Hidden Markov Models;
      Global network structure                                           Software/Pfam/) use HMMER and protein sequence inform-
      At the global level of organization, analysis of transcriptional   ation for building its library of HMMs, whereas databases
      networks has revealed that they display a scale-free topology      such as Superfamily use SAM-T99 and structural information
      [14]. In other words, such a network is characterized by the       to build a library of profiles (see Table 2). Several studies have
      presence of a few highly influential TFs that regulate several     used these domain-based approaches to predict TFs from
      genes and a large number of TFs that regulate only a few genes.    completely sequenced genomes and have discovered novel
      The highly influential TFs are referred to as global regulators,   TF families in key pathogens [28–32]. The information from
      or regulatory hubs, and their presence contributes to the          such efforts has been stored in publicly available databases,
      inherent robustness of such a topology, where robustness           which include BacTregulator, DBD and ArchaeTF (Table 2).
      is defined as the ability of complex systems to function even
      when the structure of the system is perturbed significantly
      [19,20]. A scale-free topology is robust because random            Identification of cis-regulatory elements
      inactivation of genes will probably affect the TFs that regulate   TFs bind to short DNA sequence motifs, also called
      a few genes as these occur in very high numbers. This would        binding sites, in promoter regions of transcriptional units.
      leave a central, highly connected subnetwork that may still be     All the different binding sites recognized by the same TF
      functional. However, the downside of such a network struc-         can be conveniently represented as a consensus sequence.
      ture is that they are vulnerable to targeted attacks of hubs,      Alternatively, they can be represented as PSSMs (position-
      i.e. targeted removal of the very highly connected nodes will      specific scoring matrices). Such matrices represent the
      result in the collapse of the system into small sets of isolated   probability of finding a particular nucleotide in a specific
      fragments that no longer interact with each other. Therefore       position and can be visualized using a logo representation (see
      the global regulators are believed to be crucial for the robust-   Table 2, weblogo and enologos). The methods to detect cis-
      ness and functioning of the regulatory network [21]. Global        regulatory elements that could function as possible binding
      network properties such as connectivity, modularity and clus-      sites are normally referred to as pattern discovery algorithms.
      tering can be investigated using packages such as Cytoscape,       These algorithms either rely on probabilistic description
      Pajek and Topnet (Table 2).                                        of the cis-regulatory element [33,34] or exact words that
                                                                         are statistically over-represented in a set of sequences
                                                                         [35,36]. Using these, computational search procedures
      Computational approaches to identify TFs                           that scan promoter regions can detect sequence motifs,
      and cis-regulatory elements                                        which potentially correspond to TF-binding sites. For these
                                                                         methods to work well, it is important that the background
      Identification of TFs                                               frequency of nucleotides and oligonucleotides is corrected
      The simplest unit of a transcriptional network involves a TF       accordingly, e.g. the GC content of the organism of interest.
      that senses changes in the environment and binds to a cis-            Using these in silico methods in combination with external
      regulatory element in the promoter region of the relevant TGs      information [e.g. differential expression upon knockout of
      to regulate their expression. Investigations of proteins that      a TF or regions that are enriched in ChIP (chromatin im-
      function as TFs through functional, biochemical and struc-         munoprecipitation)-chip experiments] could help in identi-
      tural methods have revealed that these proteins are modular        fying motifs with high confidence for a particular TF. In
      and contain at least two domains, where a domain is defined        addition, several evolutionary and biological principles can be
      as a structural, functional and evolutionary unit of proteins      applied to identify cis-regulatory elements. For instance, since
      [22]. Of the two domains, one recognizes and binds to              cis-elements may be subjected to evolutionary selection, they
      specific DNA sequences called the DNA-binding domain,              evolve less rapidly than the surrounding non-coding regions
      and the other is an effector domain that senses changes in the     within closely related organisms. Therefore conserved motifs
      internal or external environment [23,24]. Such an effector         upstream of orthologous genes in related species are more
      domain can detect the changes in the local environment by          likely to be true cis-regulatory elements [37]. Similarly, since
      directly binding to a small molecule (e.g. one-component           most TFs in prokaryotes function as dimeric units, they are
      systems) or by being post-translationally modified by other        more likely to recognize palindromic sequences. Therefore

      C   The Authors Journal compilation C 2008 Biochemical Society
                                                                                      New Methods for the Study of Protein–Nucleic Acid Interactions     761

motifs that are closely spaced and complementary are more           characterized transcriptional interactions and binding site
likely to be functional cis-regulatory elements in bacteria [38].   data that can be used for network reconstruction.
   Several algorithms that use PSSMs or over-represented
oligonucleotide sequences to identify regulatory elements           Reverse engineering using gene expression data
exist in the literature. A systematic comparison of some of         In this approach, one scans for patterns in gene expression
the key DNA sequence motif-detection algorithms has been            data from time-series experiments and from experiments con-
carried out recently on co-regulated genes [39]. Although           ducted across different conditions [45–50]. If a gene is consist-
many approaches aim to predict cis-regulatory elements in           ently differentially expressed (up- or down-regulated) after
a given set of sequences, combined approaches that employ           overexpression or knockout of a TF across several time points
different principles but arrive at the same motif are likely to     or different conditions, a regulatory interaction between
identify true motifs with the highest confidence. Some of the       the two is inferred. In the case of expression analysis over
commonly used motif discovery platforms include RSAT,               different experimental conditions, one infers sets of genes
seqVISTA and web-MOTIFS (Table 2).                                  with a similar expression profile across many conditions to be
                                                                    co-regulated by the same set of TFs. Such inferences become
                                                                    more accurate as the number of measurements over a certain
                                                                    period of time (the time resolution of the data) increases,
Computational approaches to investigate
                                                                    since this allows direct regulatory interactions to be dis-
transcriptional regulatory networks                                 tinguished from indirect (multistep) regulation. Variants of
While there has been significant progress in unravelling the
                                                                    this approach make use of information about experimentally
transcriptional regulatory networks of various model organ-
                                                                    well-characterized TF-binding sites to make inferences about
isms such as E. coli and Bacillus subtilis, much less informa-
                                                                    regulatory interactions. In this method, promoter regions in
tion is available on the transcriptional networks of other
                                                                    the genome of interest are scanned using known binding
prokaryotes. To gain a better understanding of the transcrip-
                                                                    site profiles of characterized TFs. The set of genes that are
tional regulatory network in other organisms, computational
                                                                    predicted to have a binding site and are differentially regulated
methods to extrapolate this information from model orga-
                                                                    upon up-regulation or knockout of the TF are inferred to be
nisms to poorly studied organisms have been developed. The
                                                                    regulated by the corresponding TF [51].
two major approaches to infer regulatory networks are des-
                                                                       While the methods mentioned above exploit different
cribed below.
                                                                    principles, there have been considerable efforts to develop
                                                                    combined approaches to predict regulatory interactions with
Template-based methods                                              a higher degree of confidence. For instance, while analysing
The template-based approach exploits the principle that             microarray expression data, the initially determined sets of
orthologous TFs regulate orthologous TGs [32,40–44]. Thus,          co-regulated genes can be refined by investigating whether
in this method, one starts with a known regulatory network          the same TF actually binds to all of them by predicting the
and transfers information about interactions to orthologous         presence or absence of a binding site in the promoter region of
genes in a target genome of interest. Such an approach may or       these genes. This approach could be extremely powerful if one
may not explicitly involve the use of binding site information      explicitly uses genomewide binding location data (e.g. ChIP-
for the TF. Methods that do not use binding site data require       chip or ChIP-seq). Such an integrated approach can directly
the complete genome sequence and the transcriptional                link a detected binding event, with a predicted cis-regulatory
regulatory network for a reference organism. The protein            element to a change in gene expression of a relevant TG.
sequence of TFs and TGs in the reference network is used            In this way, one can distinguish directly regulated genes from
to identify orthologous genes in the target organism to             the indirectly regulated ones or even genes that just randomly
infer the conserved regulatory network. The method, which           happen to show a similar expression profile [8,52–54].
uses binding site data, requires reliable information on the           For a current evaluation of reverse-engineering methods,
cis-regulatory element for a TF. Such methods exploit the           the reader is suggested to visit the DREAM project website
fact that the presence of a similar binding site upstream           ( DREAM is a Dia-
of different genes in a closely related species would imply         logue for Reverse Engineering Assessments and Methods
that the orthologous TF regulates the TGs through similar           with its main objective to catalyse the interaction between
binding sites. Both these methods have their advantages             experiment and theory in the area of cellular network infe-
and disadvantages; the former method allows prediction of           rence [55].
conserved interactions in distantly related organisms, but
does not facilitate discovery of novel targets for a given TF. In
contrast, the latter method allows detection of novel targets       Conclusions and outlook
for a TF, but is not applicable to distantly related genomes        With advancements in large-scale experimental methodolo-
because cis-regulatory elements are shorter and evolve              gies that detect protein–DNA interactions on a genomic
relatively faster than the protein-coding regions, hence            scale such as ChIP-seq and tiling arrays, the general methods
making detection of new interactions difficult. See Table 2         discussed here will be useful for investigating transcrip-
for databases that contain information about experimentally         tional regulation for completely sequenced genomes and

                                                                                        C   The Authors Journal compilation C 2008 Biochemical Society
                                                                                                                                                                                                                                                              Biochemical Society Transactions (2008) Volume 36, part 4

                                                             Table 2 Computer programs, databases and internet-based platforms for investigating transcriptional regulatory networks
The Authors Journal compilation C 2008 Biochemical Society

                                                                                                                   Comment                                                                   Website

                                                             Network visualization
                                                               Pajek                                               Visualization and analysis                                      
                                                               Cytoscape                                           Visualization and analysis                                      
                                                               Osprey                                              Visualization and analysis                                      
                                                               GraphViz                                            Visualization                                                   
                                                               H3Viewer                                            Visualization                                                   ∼munzner/h3/
                                                               Visant                                              Visualization and analysis                                      
                                                               Biolayout                                           Visualization                                                   
                                                               Yed                                                 Visualization and analysis                                      
                                                               NetMiner                                            Visualization and analysis (commercial)                         
                                                             Network analysis
                                                               Mfinder                                              Network motif finder                                             
                                                               FanMod                                              Network motif finder                                             ∼wernicke/motifs/
                                                               Clique finder                                        Identification of cliques                                        
                                                               MCode                                               Identification of densely connected subnetwork                   ∼bader/software/mcode/index.html
                                                               Cytoscape                                           Several plug-ins in Cytoscape allows advanced analysis of network topology
                                                               Vanted                                              Analysis of network containing experimental data                 
                                                               Biotapestry                                         Drawing, analysis and visualization                              
                                                               TYNA/Topnet                                         Network analysis                                                
                                                               NCT                                                 Network comparison toolkit                                      
                                                               Bioconductor                                        Network analysis and visualization                              
                                                             Domain databases, genome assignments and TF
                                                               Pfam                                                Sequence domains                                                
                                                               STRING                                              Genome context and SMART (simple modular architecture research tool)
                                                                                                                     domain assignment
                                                               CDD                                                 Sequence and structural domains                                 
                                                               COG                                                 Sequence domains                                                
                                                               CATH                                                Structural domains                                              
                                                               SCOP                                                Structural domains                                              
                                                               HOMSTRAD                                            Sequence and structural domain assignment                       ∼homstrad/
                                                               Superfamily                                         Domain assignments to completely sequenced genomes              
                                                               Gene3D                                              Domain assignments to completely sequenced genomes              
                                                               KEGG                                                Encyclopaedia of genes and genomes                              
                                                               Microbes Online                                         Domain assignment, expression data, evolutionary relationships and operon
                                                               Interpro                                                Sequence and structural domain assignments                      
                                                               BacTregulators                                          Database of TFs in bacteria and archaea                           
                                                               DBD                                                     A database of predicted TFs of over 700 completely sequenced genomes
                                                                                                                         based on SCOP DNA-binding domains
                                                               Protein Lounge                                          A database of TFs (commercial)                                    
                                                               Transfac                                                A TF database                                                     
                                                               ArchaeaTF                                               An archaeal TF database                                           
                                                             cis-Regulatory element identification, visualization and
                                                                transcriptional network databases
                                                                Vista                                                  Tools for comparative analysis of genomic sequences               
                                                               RSAT                                                    A very powerful platform for regulatory sequence analysis            
                                                               WebMOTIFS                                               Motif discovery, scoring, analysis, and visualization using different programs
                                                               seqVISTA                                                Platform for binding site discovery                               
                                                               RegTransBase                                            TF-binding sites and regulatory interactions                      
                                                               Genome Atlas                                            Atlas of completely sequenced genomes                             
                                                               Weblogo                                                 Visualizing binding site information                              
                                                               Enologos                                                Logo visualization                                                
                                                               RegulonDB                                               Database of TFs and binding sites for E. coli                     

                                                                                                                                                                                                                                                                      New Methods for the Study of Protein–Nucleic Acid Interactions

                                                               DBTBS                                                   Database of TFs and binding sites for B. subtilis                 
The Authors Journal compilation C 2008 Biochemical Society

                                                               Coryneregnet                                            Database of regulatory network for several microbes               
                                                               Prodoric                                                Prokaryotic database of gene regulation                           
                                                               TractorDB                                               Computationally predicted TF-binding sites in γ -proteobacterial genomes

764   Biochemical Society Transactions (2008) Volume 36, part 4

      for complex microbial communities. These results and                                 18 Dekel, E. and Alon, U. (2005) Optimality and evolutionary tuning of the
      predictions from the computational approaches can serve as                              expression level of a protein. Nature 436, 588–592
                                                                                           19 Albert, R. (2005) Scale-free networks in cell biology. J. Cell Sci. 118,
      a scaffold for experimental studies on transcriptional control                          4947–4957
      in poorly characterized genomes, and could be relevant for                           20 Kitano, H. (2004) Biological robustness. Nat. Rev. Genet. 5, 826–837
      designing experiments to investigate regulation in medically                         21 Albert, R., Jeong, H. and Barabasi, A.L. (2000) Error and attack tolerance
                                                                                              of complex networks. Nature 406, 378–382
      important pathogens and for engineering regulatory interac-                          22 Han, J.H., Batey, S., Nickson, A.A., Teichmann, S.A. and Clarke, J. (2007)
      tions in organisms with biotechnological value. On a more                               The folding and evolution of multidomain proteins. Nat. Rev. Mol.
      general level, such methods can be used in synthetic biology                            Cell Biol. 8, 319–330
                                                                                           23 Madan Babu, M. and Teichmann, S.A. (2003) Evolution of transcription
      experiments that aim to design circuits with specific kinetic                           factors and the gene regulatory network in Escherichia coli.
      and regulatory properties [56] and may identify TFs with new                            Nucleic Acids Res. 31, 1234–1244
      modes of regulation.                                                                 24 Wilson, D., Charoensawan, V., Kummerfeld, S.K. and Teichmann, S.A.
                                                                                              (2008) DBD – taxonomically broad transcription factor predictions: new
                                                                                              content and functionality. Nucleic Acids Res. 36, D88–D92
      I acknowledge the Medical Research Council, Darwin College and                       25 Aravind, L., Anantharaman, V., Balaji, S., Babu, M.M. and Iyer, L.M.
                                                                                              (2005) The many faces of the helix–turn–helix domain: transcription
      Schlumberger for generous support. I thank Arthur Wuster, Rekin’s                       regulation and beyond. FEMS Microbiol. Rev. 29, 231–262
      Janky and Nitish Mittal for critically reading this paper and for helpful            26 Seshasayee, A.S., Bertone, P., Fraser, G.M. and Luscombe, N.M. (2006)
      suggestions.                                                                            Transcriptional regulatory networks in bacteria: from input signals to
                                                                                              output responses. Curr. Opin. Microbiol. 9, 511–519
                                                                                           27 Martinez-Antonio, A., Janga, S.C., Salgado, H. and Collado-Vides, J. (2006)
                                                                                              Internal-sensing machinery directs the activity of the regulatory network
      References                                                                              in Escherichia coli. Trends Microbiol. 14, 22–27
                                                                                           28 Balaji, S., Babu, M.M., Iyer, L.M. and Aravind, L. (2005) Discovery of the
       1 Ptashne, M. (2005) Regulation of transcription: from lambda to
                                                                                              principal specific transcription factors of Apicomplexa and their
         eukaryotes. Trends Biochem. Sci. 30, 275–279
                                                                                              implication for the evolution of the AP2-integrase DNA binding domains.
       2 Browning, D.F. and Busby, S.J. (2004) The regulation of bacterial
                                                                                              Nucleic Acids Res. 33, 3994–4006
         transcription initiation. Nat. Rev. Microbiol. 2, 57–65
       3 Zaman, Z., Ansari, A.Z., Gaudreau, L., Nevado, J. and Ptashne, M. (1998)          29 Babu, M.M., Iyer, L.M., Balaji, S. and Aravind, L. (2006) The natural
         Gene transcription by recruitment. Cold Spring Harbor Symp. Quant. Biol.             history of the WRKY–GCM1 zinc fingers and the relationship between
         63, 167–171                                                                          transcription factors and transposons. Nucleic Acids Res. 34, 6505–6520
       4 Huerta, A.M., Salgado, H., Thieffry, D. and Collado-Vides, J. (1998)              30 Martinez-Bueno, M., Molina-Henares, A.J., Pareja, E., Ramos, J.L. and
         RegulonDB: a database on transcriptional regulation in Escherichia coli.             Tobes, R. (2004) BacTregulators: a database of transcriptional regulators
         Nucleic Acids Res. 26, 55–59                                                         in bacteria and archaea. Bioinformatics 20, 2787–2791
       5 Ishii, T., Yoshida, K., Terai, G., Fujita, Y. and Nakai, K. (2001) DBTBS:         31 Kummerfeld, S.K. and Teichmann, S.A. (2006) DBD: a transcription factor
         a database of Bacillus subtilis promoters and transcription factors.                 prediction database. Nucleic Acids Res. 34, D74–D81
         Nucleic Acids Res. 29, 278–280                                                    32 Madan Babu, M., Teichmann, S.A. and Aravind, L. (2006) Evolutionary
       6 Baumbach, J., Brinkrolf, K., Czaja, L.F., Rahmann, S. and Tauch, A. (2006)           dynamics of prokaryotic transcriptional regulatory networks. J. Mol. Biol.
         CoryneRegNet: an ontology-based data warehouse of corynebacterial                    358, 614–633
         transcription factors and regulatory networks. BMC Genomics 7, 24                 33 Bailey, T.L. and Elkan, C. (1995) The value of prior knowledge in
       7 Kazakov, A.E., Cipriano, M.J., Novichkov, P.S., Minovitsky, S., Vinogradov,          discovering motifs with MEME. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3,
         D.V., Arkin, A., Mironov, A.A., Gelfand, M.S. and Dubchak, I. (2007)                 21–29
         RegTransBase: a database of regulatory sequences and interactions in a            34 Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and
         wide range of prokaryotic genomes. Nucleic Acids Res. 35, D407–D412                  Wootton, J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling
       8 Wade, J.T., Struhl, K., Busby, S.J. and Grainger, D.C. (2007) Genomic                strategy for multiple alignment. Science 262, 208–214
         analysis of protein–DNA interactions in bacteria: insights into                   35 Janky, R. and van Helden, J. (2007) Discovery of conserved motifs in
         transcription and chromosome organization. Mol. Microbiol. 65, 21–26                 promoters of orthologous genes in prokaryotes. Methods Mol. Biol. 395,
       9 Hawkins, R.D. and Ren, B. (2006) Genome-wide location analysis:                      293–308
         insights on transcriptional regulation. Hum. Mol. Genet. 15, R1–R7                36 Wasserman, W.W. and Sandelin, A. (2004) Applied bioinformatics for the
      10 Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K.,      identification of regulatory elements. Nat. Rev. Genet. 5, 276–287
         Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I. et al. (2002)            37 Blanchette, M., Schwikowski, B. and Tompa, M. (2002) Algorithms for
         Transcriptional regulatory networks in Saccharomyces cerevisiae.                     phylogenetic footprinting. J. Comput. Biol. 9, 211–223
         Science 298, 799–804                                                              38 van Helden, J., Rios, A.F. and Collado-Vides, J. (2000) Discovering
      11 Horak, C.E., Luscombe, N.M., Qian, J., Bertone, P., Piccirrillo, S., Gerstein,       regulatory elements in non-coding sequences by analysis of spaced
         M. and Snyder, M. (2002) Complex transcriptional circuitry at the G1 /S              dyads. Nucleic Acids Res. 28, 1808–1818
         transition in Saccharomyces cerevisiae. Genes Dev. 16, 3017–3033                  39 Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E.,
      12 Grainger, D.C., Hurd, D., Goldberg, M.D. and Busby, S.J. (2006)                      Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J. et al. (2005) Assessing
         Association of nucleoid proteins with coding and non-coding segments                 computational tools for the discovery of transcription factor binding sites.
         of the Escherichia coli genome. Nucleic Acids Res. 34, 4642–4652                     Nat. Biotechnol. 23, 137–144
      13 Grainger, D.C., Overton, T.W., Reppas, N., Wade, J.T., Tamai, E., Hobman,         40 Lozada-Chavez, I., Janga, S.C. and Collado-Vides, J. (2006) Bacterial
         J.L., Constantinidou, C., Struhl, K., Church, G. and Busby, S.J. (2004)              regulatory networks are extremely flexible in evolution.
         Genomic studies with Escherichia coli MelR protein: applications of                  Nucleic Acids Res. 34, 3434–3445
         chromatin immunoprecipitation and microarrays. J. Bacteriol. 186,                 41 Yu, H., Luscombe, N.M., Lu, H.X., Zhu, X., Xia, Y., Han, J.D., Bertin, N.,
         6938–6943                                                                            Chung, S., Vidal, M. and Gerstein, M. (2004) Annotation transfer
      14 Babu, M.M., Luscombe, N.M., Aravind, L., Gerstein, M. and Teichmann,                 between genomes: protein–protein interologs and protein–DNA
         S.A. (2004) Structure and evolution of transcriptional regulatory                    regulogs. Genome Res. 14, 1107–1118
         networks. Curr. Opin. Struct. Biol. 14, 283–291                                   42 Alkema, W.B., Lenhard, B. and Wasserman, W.W. (2004) Regulog
      15 Huber, W., Carey, V.J., Long, L., Falcon, S. and Gentleman, R. (2007)                analysis: detection of conserved regulatory networks across bacteria:
         Graphs in molecular biology. BMC Bioinformatics 8, (Suppl. 6), S8                    application to Staphylococcus aureus. Genome Res. 14, 1362–1373
      16 Alon, U. (2007) Network motifs: theory and experimental approaches.               43 Price, M.N., Dehal, P.S. and Arkin, A.P. (2007) Orthologous transcription
         Nat. Rev. Genet. 8, 450–461                                                          factors in bacteria have different functions and regulate different genes.
      17 Shen-Orr, S.S., Milo, R., Mangan, S. and Alon, U. (2002) Network motifs              PLoS Comput. Biol. 3, 1739–1750
         in the transcriptional regulation network of Escherichia coli. Nat. Genet.        44 Gelfand, M.S. (2006) Evolution of transcriptional regulatory networks in
         31, 64–68                                                                            microbial genomes. Curr. Opin. Struct. Biol. 16, 420–429

      C   The Authors Journal compilation C 2008 Biochemical Society
                                                                                                          New Methods for the Study of Protein–Nucleic Acid Interactions     765

45 Segal, E., Friedman, N., Kaminski, N., Regev, A. and Koller, D. (2005)           55 Stolovitzky, G., Monroe, D. and Califano, A. (2007) Dialogue on
   From signatures to models: understanding cancer using microarrays.                  reverse-engineering assessment and methods: the DREAM of
   Nat. Genet. 37, (Suppl.), S38–S45                                                   high-throughput pathway inference. Ann. N.Y. Acad. Sci. 1115,
46 Margolin, A.A. and Califano, A. (2007) Theory and limitations of genetic            1–22
   network inference from microarray data. Ann. N.Y. Acad. Sci. 1115,               56 Chin, J.W. (2006) Modular approaches to expanding the functions of
   51–72                                                                               living matter. Nat. Chem. Biol. 2, 304–311
47 Gardner, T.S., di Bernardo, D., Lorenz, D. and Collins, J.J. (2003) Inferring    57 Horak, C.E. and Snyder, M. (2002) ChIP-chip: a genomic approach for
   genetic networks and identifying compound mode of action via                        identifying transcription factor binding sites. Methods Enzymol. 350,
   expression profiling. Science 301, 102–105                                           469–483
48 Wang, Y., Joshi, T., Zhang, X.S., Xu, D. and Chen, L. (2006) Inferring gene      58 Lee, T.I., Johnstone, S.E. and Young, R.A. (2006) Chromatin
   regulatory networks from multiple microarray datasets. Bioinformatics               immunoprecipitation and microarray-based analysis of protein location.
   22, 2413–2420                                                                       Nat. Protoc. 1, 729–748
49 Nachman, I., Regev, A. and Friedman, N. (2004) Inferring quantitative            59 Buck, M.J. and Lieb, J.D. (2004) ChIP-chip: considerations for the design,
   models of regulatory networks from expression data. Bioinformatics 20,              analysis, and application of genome-wide chromatin
   (Suppl. 1), i248–i256                                                               immunoprecipitation experiments. Genomics 83, 349–360
50 Friedman, N. (2004) Inferring cellular networks using probabilistic              60 Hudson, M.E. and Snyder, M. (2006) High-throughput methods of
   graphical models. Science 303, 799–805                                              regulatory element discovery. BioTechniques 41, 673–681
51 Bussemaker, H.J., Li, H. and Siggia, E.D. (2001) Regulatory element              61 Fields, S. (2007) Molecular biology: site-seeing by sequencing. Science
   detection using correlation with expression. Nat. Genet. 27, 167–171                316, 1441–1442
52 Bar-Joseph, Z., Gerber, G.K., Lee, T.I., Rinaldi, N.J., Yoo, J.Y., Robert, F.,   62 Johnson, D.S., Mortazavi, A., Myers, R.M. and Wold, B. (2007)
   Gordon, D.B., Fraenkel, E., Jaakkola, T.S., Young, R.A. and Gifford, D.K.           Genome-wide mapping of in vivo protein–DNA interactions. Science
   (2003) Computational discovery of gene modules and regulatory                       316, 1497–1502
   networks. Nat. Biotechnol. 21, 1337–1342                                         63 Greil, F., Moorman, C. and van Steensel, B. (2006) DamID: mapping of
53 Lang, B., Blot, N., Bouffartigues, E., Buckle, M., Geertz, M., Gualerzi, C.O.,      in vivo protein-genome interactions using tethered DNA adenine
   Mavathur, R., Muskhelishvili, G., Pon, C.L., Rimsky, S. et al. (2007)               methyltransferase. Methods Enzymol. 410, 342–359
   High-affinity DNA binding sites for H-NS provide a molecular basis for            64 Bulyk, M.L. (2006) DNA microarray technologies for measuring
   selective silencing within proteobacterial genomes. Nucleic Acids Res.              protein–DNA interactions. Curr. Opin. Biotechnol. 17, 422–430
   35, 6330–6337
54 Kim, H., Hu, W. and Kluger, Y. (2006) Unraveling condition specific gene
   transcriptional regulatory networks in Saccharomyces cerevisiae.                 Received 18 March 2008
   BMC Bioinformatics 7, 165                                                        doi:10.1042/BST0360758

                                                                                                            C   The Authors Journal compilation C 2008 Biochemical Society