Docstoc

DOCX - Stanford University

Document Sample
DOCX - Stanford University Powered By Docstoc
					    Comparative genome analysis suggests characteristics of yeast inverted repeats that are
                         important for transcriptional activity


         Emily L. Humphrey-Dixona,c, Richard Sharpb, Michael Schuckersb, Robin Lockb




a
 Departments of Biology and Chemistry,
St. Lawrence University,
23 Romoda Drive, Canton, NY 13617
Email- Emily Humphrey-Dixon: edixon@stlawu.edu
b
 Department of Mathematics, Statistics and Computer Science,
St. Lawrence University,
23 Romoda Drive, Canton, NY 13617
Email- Richard Sharp: rsharp@stlawu.edu
Email- Michael Schuckers: schuckers@stlawu.edu
Email- Robin Lock: rlock@stlawu.edu
c
 Corresponding Author
Departments of Biology and Chemistry
St. Lawrence University
23 Romoda Drive
Canton, NY 13617
Phone: (315) 229-5671
Fax: (315) 229-7429
Email: edixon@stlawu.edu




                                              1
ABSTRACT

Inverted repeats are sequences of DNA that, when read in the 5’ to 3’ direction, have the same

sequence on both strands (palindromic portion), with the exception of a small number of

nucleotides in the exact center (non-palindromic spacer). They have been implicated in various

DNA-mediated processes including replication, transcription and genomic instability. At least

some of these sequences are capable of forming an alternative DNA structure called a cruciform

that may be important for mediating these functions. We generated a list of inverted repeats in

the S. cerevisiae genome, and determined which of them are conserved in three related yeasts.

We have identified characterisitics of inverted repeats that make them more likely to be

conserved than the surrounding DNA, and characteristics, such as position and base composition,

that make the genes they are associated with more likely to be transcriptionally active. This is an

important step in determining the functions of this group of genomic elements.



Keywords: inverted repeat; DNA structure; palindrome; transcription; comparative genome

analysis; DNA conformation




                                                 2
INTRODUCTION

       Specific sequence motifs that are important for DNA metabolism have been identified in

many genomes. However, there are also alternative DNA structures that can form from many

different DNA sequences. Sequences capable of forming these structures are harder to identify,

and the precise characteristics of these sequences that allow them to form alternative structures in

vivo are not well understood. However, because these alternative structures are likely to change

the way proteins interact with the DNA, it is thought that they may play important roles in DNA

metabolism. One such structure is the cruciform. This structure can be formed by DNA

sequences that contain inverted repeats (IRs). These are sequences of DNA that, when read in

the 5’ to 3’ direction, have the same sequence on both strands (palindromic portion), with the

exception of a spacer consisting of a small number of nucleotides in the exact center (Fig. Table

1). IRs without a central spacer sequence are also known as palindromes. Under the right

conditions, at least some IRs can undergo intrastrand base pairing, giving rise to the cruciform

structure. This structure has been shown to form in vitro and in vivo in negatively supercoiled

DNA {{231 Singleton,C.K. 1982; 234 Krasilnikov,A.S. 1999; 232 Dayn,A. 1992; 236

Greaves,D.R. 1985; 70 Panyutin,I. 1985; 238 Panayotatos,N. 1987}}.

       Inverted repeats and the cruciform structure have been implicated in several DNA- and

RNA-related processes including DNA replication {{230 Jin,R. 1997}}, recombination {{91

Nasar,F. 2000}}, transcription {{225 Horwitz,M.S. 1988; 57 Bagga,R. 1990}}, and DNA repair

{{222 Downing,B. 2008}}. While there have been several recent genome-wide studies focusing

on long IRs {{43 Callejo,M. 2002; 94 Zhao,G. 2007; 93 Wang,Y. 2009; 83 Lisnic,B. 2009; 219

Wang,Y. 2006}}, shorter IRs have not been well studied. While long IRs (defined as being

anywhere from greater than 30 bp to greater than 160 bp) have been shown to lead to genomic



                                                 3
instability (reviewed in {{33 Lobachev,K.S. 2007}}) and have been found to be

underrepresented in genomes, IRs of moderate length (10-30 bp) have been shown to be

overrepresented in genomes and short IRs (<10 bp) have been shown to be underrepresented

{{50 Schroth,G.P. 1995; 85 Lu,L. 2007; 89 LeBlanc,M.D. 2000}}. This suggests that long,

moderate, and short IRs may be functional, and their functions may be different. So far it has

been difficult to use genome-wide methods to determine what characteristics make an IR of

moderate or short length likely to be functional, and what if any function a given IR may have.

       There have been several reports of genome-wide searches for IRs in varied species {{87

Lisnic,B. 2005; 85 Lu,L. 2007; 50 Schroth,G.P. 1995; 89 LeBlanc,M.D. 2000; 94 Zhao,G. 2007;

239 Strawbridge,E.M. 2010}}. While these studies resulted in important information about the

distribution of IRs throughout genomes, the large numbers and wide distribution of these IRs

makes it difficult or impossible to attribute a function to particular IRs or types of IRs without

testing each one individually. Comparative genome analysis of C. elegans and related species

has shown that a mechanism exists to maintain the structure of long IRs in intergenic regions

{{94 Zhao,G. 2007}}. Comparative genome analysis of four closely related yeast species has

previously uncovered, among other things, many sequence-specific regulatory motifs {{86

Kellis,M. 2003}}. With the availability of the genome sequences of these yeasts, it is now

possible to perform a comparative analysis of the conservation of IRs between yeast species.

       Here, we report the results of a comprehensive genome-wide search for IRs in the

genomes of S. cerevisiae and the three related yeast species S. paradoxus, S. mikatae and S.

bayanus. We have determined which IRs are conserved between species, and taken these data

together with genome-wide measures of transcriptional activity in S. cerevisiae to determine




                                                  4
what characteristics of IRs make them more likely to be associated with transcriptionally active

genes.


METHODS

Definition of IRs

         Our definition of an inverted repeat is intentionally broad in order to identify which types

of IRs are most likely to be functional. We required the length of each half of the IR to be at

least 3 nucleotides long (palindromic length of 6), and allowed the non-palindromic spacer

sequence to range from 0 to 10 nucleotides (fig.Table 1). While the small non-palindromic

spacer sequence in the middle of an IR cannot participate in the intrastrand base pairing, these

sequences are thought to form a single-stranded loop at the tips of the cruciform structure. We

chose a maximum of 10 bases of spacer sequence because it has been suggested that cruciform

extrusion is independent of the sequence of the central 10 bases of a long IR {{49 Allers,T.

1995}}. This is in spite of the previous findings that the kinetics of cruciform extrusion depend

on the length and sequence of this spacer {{73 Lilley,D.M. 1985}}. This definition of IRs

encompases includes the types of sequences studied by other groups in several recent studies of

IRs {{87 Lisnic,B. 2005; 85 Lu,L. 2007; 50 Schroth,G.P. 1995}}. When looking at the

conservation of IRs, we required the entire IR sequence to be identical between species, but

allowed the non-palindromic spacer sequence to vary (Fig.Table 1).


Sequences

           The sequences of promoters and coding regions for S. cerevisiae, S. paradoxus, S.

mikatae and S. bayanus were obtained from {{86 Kellis,M. 2003}} via the authors’ website

(http://www-genome.wi.mit.edu/seq/Saccharomyces/). All genes were considered for the



                                                  5
analysis of the S. cerevisiae genome, while only genes for which an exact match was available in

all 4 species were considered for the analysis of conserved IRs. We defined the promoter region

as the intergenic region upstream of a gene. Inverted repeats were found in several

mitochondrial genes, however these IRs were excluded from our analysis.


Inverted Repeat Detection

          Inverted repeats were detected using a center anchor substring searching algorithm

(Fig. 1). The algorithm examines all base pairs across a DNA region to determine if the center of

an IR is anchored on that base.

          At each base the algorithm progresses through two phases. The first phase searches

pairwise outward (one base upstream and one base downstream) from the anchor to count the

number of non-IR bases that occur; these bases make up the non-palindromic core of the IR.

Since core lengths can be even and odd, we run this algorithm on both. Odd core lengths search

outward from the center anchor. Even core lengths use the anchor as the left base in the first pair

to compare. During this phase, if the length of the non-IR core exceeds some large value (we

used 10) the test fails and the next anchor is tested. If an IR pair is encountered the test switches

to the IR detection phase.

          In the IR detection phase, bases are compared pairwise outward from the core. Any IR

matching pair is included in the total IR substring. If a non-IR pair is encountered, the algorithm

records the current IR and continues a search using the current IR as a non-IR core to see if a

longer IR can be detected. If the current IR will create a core is larger than the maximum size

(10 bp), the search is terminated and the largest IR, if its length is sufficiently long (at least 6 bp),

is reported. In either case the algorithm continues its search until the anchor passes through all

bases in the DNA region.

                                                   6
Inverted Repeat Conservation

         Conserved IRs were identified by considering only those genes determined by Kellis et

al.{{86 Kellis,M. 2003}} to have one-to-one correspondence between S. cerevisiae, S.

paradoxus, S. mikatae and S. bayanus. We considered an IR as conserved if the identical

palindromic base pairs occurred at the same position in the DNA alignment of the four species

created by Kellis et al. {{86 Kellis,M. 2003}}, irrespective of conservation in the species' core

regions. We used previously determined conservation rates for promoters and coding regions

{{86 Kellis,M. 2003}} to determine how many of the IRs we detected are expected to be

conserved. Expected conservation rates for IRs of each palindromic length were determined

using the following formula, where R is the expected conservation rate, C is the conservation

rate for individual bases in either the promoter or coding region and L is the palindromic length:

                                               R=CL                                                    Formatted: Centered
                                                                                                       Formatted: Superscript
by raising the conservation rate for individual bases in either the promoter or coding region to the

power of the palindromic length to determine the likelihood of all bases of the IR being

conserved. The number of IRs found in S. cerevisiae genes that have one-to-one correspondence

with the other three species, N, was multiplied by this expected conservation rate to determine

the expected number of conserved IRs. The expected number of conserved IRs, E, was

determined as follows:

                                              E=NR                                                     Formatted: Centered




Synthetic Genome Generation




                                                 7
          We generated synthetic S. cerevisiae genomes to test the statistical significance of long

IRs in the original genome. Synthetic genomes were generated using a classic second order

Markov chain method whose base pair distributions were sampled from the promoter and coding

regions of all genes in the original S. cererevesiae genome that were reported in {{86 Kellis,M.

2003}}.

          To control for variance in average IR lengths between individual simulations, we

measured the standard deviation in average IR lengths of 10, 100, and 1000 synthetic genomes.

The deviation between 100 and 1000 runs was nearly identical suggesting that a higher number

of iterations would not significantly reduce variance.

          As described above, the cutoff of long core lengths makes this algorithm's run time

proportional to the length of the number of anchor bases in the DNA region. In practice,

runtimes were reasonable; the generation and IR detection for 1000 synthetic genomes executed

in under two hours on a Intel Core 2 2.66Ghz workstation with 2GB of RAM. Our analysis of

significance of IR length was calculated by directly observing what percentage of synthetic

genomes had inverted repeats greater than the observed value in the original genome; this

percentage is, by definition, a p-value.

Cluster Detection


We define an IR cluster as a set of IRs that overlap on at least one base with one other IR in the

set. After the IR detection phase is complete, we detect clusters by sorting IRs by increasing

start position and then by increasing end position if start positions are identical. We detect IR

clusters by inspecting IRs by increasing start position and testing if the start position is less than

the previous IR’s ending position; if so it is added to the current cluster, an IR whose start

position is greater than the previous IR’s ending position indicates the end of a cluster.

                                                   8
RESULTS

Inverted repeat content of the Saccharomyces cerevisiae genome

       We found that S. cerevisiae has 1,330,320 IRs with at least 1 IR found in each of the

5289 genes analyzed. We found 949,234 IRs in coding regions and 381,086 IRs in promoters.

As expected, fewer IRs are found as the palindromic length increases. The longest IR we found

was in the promoter of YDL131W (LYS21). It has a palindromic length of 44 bp and no non-

palindromic core. In promoters, IRs with palindromic lengths of 10 or less occur less frequently

than expected, and those with palindromic lengths of 14 or greater occur more frequently than

would be expected. In coding regions, only IRs with palindromic lengths of 6, 8, and 20 or

greater occur more often than expected (Table 21). While it appears that for some palindromic

lengths there are many moreis a higher density of IRs in the coding regions than in the

promoters, this is because the total length of S. cerevisiae promoters that we analyzed is

approximately half the total length of coding regions that we analyzed. We found a similar

number of IRs per base analyzed in promoters and coding regions for palindromic lengths ≤ 10,

but more IRs per base analyzed in the promoters for palindromic lengths of 12 and greater. A

complete list of the IRs we found in S. cerevisiae promoters and coding regions can be found in     Formatted: Font: Italic


Supplementary Tables 1, and 2, respectively.



Inverted Repeat Conservation

       The large number of IRs and their distribution among all S. cerevisiae genes makes it

difficult to determine which IRs are most likely to be functional, what function they may have,

and what characteristics might define functional IRs. Because of this, we decided to look for IRs



                                                 9
that are conserved between 4 closely related yeast species. We found that 122,147 of the

1,035,415 IRs found in S. cerevisiae in the 4179 genes that can be unambiguously aligned were

conserved in all four species. IRs of all palindromic lengths except 20 found in promoters and

coding regions are more likely to be conserved than would be expected based on conservation

rates determined by Kellis et al. {{86 Kellis,M. 2003}} (Table 32). This suggests that in both

promoters and coding regions, IRs are likely to be more functional. than the surrounding

sequence.

       We also looked at the presence of IRs with different spacer lengths in S. cerevisiae and

the likelihood of IRs with different spacer lengths being conserved. To control for the different

likelihoods of conservation of IRs with different palindromic lengths, we looked at rates of

conservation within each palindromic length. We found that no spacer length is found more

frequently than expected, and we saw no relationship between spacer length and IR conservation

(data not shown).

       We next determined whether base composition altered the likelihood of IRs being

conserved. While most of the IRs we found are composed of all four bases, we found many A/T-

only IRs, and a smaller number of G/C-only IRs. When base composition was not taken into

account, we found that 105,597 of the 762,183 S. cerevisiae coding region IRs were conserved

(13.8%) and 16,550 of the 273,232 S. cerevisiae promoter IRs were conserved (6.1%). In coding

regions, S. cerevisiae has 187,036 A/T-only IRs, and 29,270 are conserved (15.6%). In

promoters, we found 88,544 A/T-only IRs in S. cerevisiae, and 6,676 of them were conserved

(7.5%). In coding regions, S. cerevisiae has 13,781 G/C-only IRs and 1,129 are conserved

(8.2%). In promoters, S. cerevisiae has 6,214 G/C-only IRs, and 1,121 are conserved (18.0%).

This higher level of conservation suggests that A/T-only IRs in both promoters and coding



                                                10
regions and G/C-only IRs in promoters may be more likely to be functional than IRs in general.

A complete list of the conserved IRs we found in promoters and coding regions can be found in

Supplementary Tables 3 and 4, respectively.


Genes with conserved inverted repeats are more likely to be transcriptionally active

       To determine whether conserved IRs may function to control transcription, we looked at

the percent of genes with IRs that are among the top 20% most transcriptionally active genes

using mRNA abundance, expressed as absolute cellular mRNA levels, {{114 Bernstein,B.E.

2002}} as a measure of transcriptional activity. While mRNA abundance is determined by the

rates of both transcription and mRNA degradation, it is a good approximation of transcriptional

activity for many genes. We grouped the data by region (promoter or coding region) and by

palindromic length (Fig. 2). We found that genes with conserved IRs in their promoters or

coding regions tend to be more transcriptionally active , based on mRNA levelsby both measures

of transcriptional activity. This tendency increases with IR length. Genes with IRs of

palindromic length 10 or greater in their promoters are significantly more likely to be

transcriptionally active, but in coding regions genes must have an IR of at least palindromic

length 12 to be more likely to be transcriptionally active. This suggests that IRs in both

promoters and coding regions may be associated with transcriptional activity, however the

mechanism by which IRs play a role in transcription may be different in the two regions.


A/T inverted repeats are common in the S. cerevisiae genome, but are not more likely than
other inverted repeats to make a gene more likely to be transcriptionally active

       A/T-rich IRs are thought to be more likely to form cruciform structures than those with

high G/C content {{63 Zheng,G.X. 1988; 232 Dayn,A. 1992; 235 Courey,A.J. 1988}}, and thus

may play a stronger role in transcriptional regulation than IRs with other base compositions. We

                                                 11
found a large number of conserved IRs that are composed entirely of A’s and T’s. We found

6,676 A/T-only conserved IRs in 2,454 promoters and 29,270 A/T-only conserved IRs in 3,576

coding regions. In total, we found that 35,946 of the 122,147 conserved IRs are A/T-only

(29.4%). We looked at all A/T-only IRs and asked whether genes that contain them are more

likely to be transcriptionally active. Genes that had conserved A/T-only IRs in their promoters

had an average transcriptional activity of 1.32 (M=0.014), compared to an average of 1.31

(M=0.012) for genes with at least one conserved IR in their promoter and 1.28 (M=0.010) for

all genes present in the four species. Genes with conserved A/T-only IRs in their coding regions

had an average transcriptional activity of 1.27 (M=0.010) compared to an average of 1.28

(M=0.010) for genes with at least one conserved IR in their coding region and 1.28 (M=0.010)

for all genes present in the four species. Thus, we found that having an A/T-only IR in the

promoter gave genes about the same increase in the chance that they would be transcriptionally

active as did having an IR of any base composition, while having any conserved IR or an A/T-

only IR in the coding region does not alter the chances of a gene being transcriptionally active.



G/C-only inverted repeats are rare, and increase the chances of a gene being
transcriptionally active

       G/C-only IRs were not as common in the data as A/T-only IRs (2,250 of the 122,147

conserved IRs, 1.8%). The longest palindromic length we observed for conserved IRs of this

type was 10 bp in both promoters and coding regions and only one conserved 10 bp G/C-only IR

was found in coding regions. However, we wanted to determine whether the presence of these

IRs made genes more likely to be transcriptionally active. Genes that had conserved G/C-only

IRs in their promoters had an average transcriptional activity of 1.34 (M=0.027), compared to

an average of 1.31 (M=0.012) for genes with at least one conserved IR in their promoters and

                                                12
1.28 (M=0.010) for all genes that are present in the four species. Genes that had conserved

G/C-only IRs in their coding regions had an average transcriptional activity of 1.40 (M=0.025),

compared to an average of 1.28 (M=0.010) for genes with any conserved IR in their coding

region and 1.28 (M=0.010) for all genes that are present in the four species. This suggests that

having a G/C-only IR in the coding region increases the chances of a gene being transcriptionally

active.



Inverted repeat position helps determine conservation and transcriptional activity

          To determine whether the location of inverted repeats within promoters or coding regions

has an effect on the likelihood of the IR being functional, we examined the locations of

conserved IRs and compared them to the locations of all S. cerevisiae IRs. We found that IRs

are distributed across the entire coding regions of genes. In S. cerevisiae, there are the most IRs

in the first 10% of the coding region, and the number of IRs decreases slightly from the

beginning to the end of the coding region. Conserved IRs are most commonly found in the first

10% of coding regions, with a slight peak in frequency just over half way through the coding

region (Fig. 3). In S. cerevisiae promoters, IRs are most frequently found near the start site.

Their frequency decreases with distance from the start site, but the distribution has a long tail.

The frequency of conserved IRs peaks about 150 bases from the start site and decreases sharply

on both sides of the distribution (Fig. 3).

          Conservation rates in yeast promoters vary slightly across the promoter and range from

just below 0.4 to about 0.55 {{242 Chin,C.S. 2005}}. In the region from 125-150 bases before

the start codon, where we found the most conserved IRs, the conservation rate is near its highest.

We found 10,037 IRs in this region in S. cerevisiae and 1,088 of them were conserved. This is a


                                                 13
higher level of conservation than would be expected by chance even with a 0.55 conservation

rate in this region (p<0.001). This suggests that the increased frequency of conserved IRs that

we observe in this region is not a result of increased conservation rates.

       We found a very high density of IRs, particularly in some portions of the promoters and

coding regions. While we only count IRs as distinct if they have different centers of symmetry,

many IRs do overlap. We defined a cluster of IRs as a consecutive sequence of bases, all of

which are involved in an IR. In S. cerevisiae coding regions we found an average of 33.1             Formatted: Font: Italic


clusters, while in promoters we found an average of 14.4 clusters. We found on average 5.3 IRs

per cluster in coding regions and 5.4 IRs per cluster in promoters.

       Because promoters are directly involved in determining gene transcription and the

portion of the promoters close to the start codon are most likely to alter the transcription of

genes, we wanted to determine whether there is a link between the location of conserved

promoter IRs and transcriptional activity. We determined the average locations of conserved IRs

for genes in the top and bottom 20% of mRNA abundance. In this analysis, we excluded IRs

shorter than 10 bp because unlike longer IRs, by both measures of transcriptional activity they do

not tend to be associated with transcriptionally active genes (Fig. 2). We found that in

promoters, genes in the top 20% of mRNA abundance have an average IR location of

approximately 434 (M=7.4) bases before the start site, while genes in the bottom 20% of mRNA

abundance have an average promoter IR location of approximately 499 (M=9.1) bases before

the start site. Genes with both high and low mRNA abundance have approximately the same

number of clusters and IRs per cluster in their promoters.

       We found that the average transcriptional activity of genes with at least one conserved

promoter IR at least 10 bp long in the 500 bases before the start codon is 1.48 (M=0.032),


                                                 14
compared to an average transcriptional activity of 1.31 (M=0.012) for all genes with conserved

IRs.

       Taken together, this suggests that transcriptionally active genes tend to have promoter

IRs closer to the beginning of the gene than less active genes do.


DISCUSSION

       Our findings with regard to IRs in S. cerevisiae are consistent with those of prior studies

{{87 Lisnic,B. 2005; 50 Schroth,G.P. 1995}}. Because we looked only at IRs with no

mismatches and relatively small non-palindromic spacer sequences, we did not detect many long

IRs. It would be expected that longer IRs would be less sensitive to mismatches and non-

palindromic spacer length in forming cruciform structures. However, these features would be

less likely to be tolerated in shorter IRs forming cruciforms. Because our focus was on these

shorter IRs, we believe that our parameters for IR searching were appropriate. However, we

recognize that there are likely to be additional long imperfect IRs that are functional.

       None of the conserved IRs we found (maximum palindromic length of 20) were long

enough to cause genomic instability. This finding is not surprising given that deleterious

sequences are less likely to be conserved between species. However, it is possible that there are

long IRs that do cause genomic instability present in all four genomes in the same location,

though with different sequences, mismatches, and/or longer non-palindromic spacer sequences.

       While conserved DNA elements have often been found to be more likely to be functional,

in the case of IRs it is possible that conservation is a result of these sequences being more

resistant to point mutations as a result of their structure. We think this is unlikely given the

relationships between IR conservation and transcriptional activity, however if conservation is a




                                                 15
result of these sequences taking on a different physical structure then these data provide evidence

that the short IRs we found are able to take on alternative structures in vivo.

       Shorter conserved IRs like the ones we found in promoters could play multiple roles in

DNA metabolism. First, we know that some of these IRs are specific transcription factor binding

sites. For example, several sequence-specific dimeric transcription factors, such as leucine

zipper transcription factors, are known to bind to particular short IR sequences. We know that

this accounts for some of the conserved IRs we found though these represent a small fraction of

the conserved IRs we detected. Second, they could be involved in DNA replication. One

inverted repeat with a palindromic length of 10, and a longer non-palindromic core than we

allowed has been shown to form a cruciform at ARS307 {{43 Callejo,M. 2002}}. In addition,

14-3-3 proteins, which are known to bind to cruciform structures, have been found to regulate

the G1/S transition in yeast {{228 Lottersberger,F. 2006}}. Third, conserved IRs found in

promoters could be involved in transcription regulation. Because many of the conserved IRs we

found are located in transcriptionally active genes, our analysis suggests that some of the IRs we

found in promoters may function in this way. It is possible, however, that we found conserved

IRs in the promoters of active genes because the promoters of these genes are more highly

conserved than the promoters of less active genes. While the neutral mutation rate does not vary

across the yeast genome, suggesting that in the yeasts we examined conserved elements are

likely to be functional, some promoters do have higher levels of conservation{{242 Chin,C.S.

2005}}. This is thought to be a result of more functional promoter elements being required for

the more complex regulation of these genes. It is possible that the IRs we found in these genes

are some of these regulatory elements.




                                                 16
       Our analysis suggests that conserved IRs in coding regions may also play a role in

transcriptional regulation. While our results suggest that IRs in coding regions are conserved at a

similar rate to other coding region sequences, this does not necessarily suggest that they are non-

functional. Because all coding region sequences are functional and highly conserved, coding

region IRs may be functional without being conserved at a higher rate than surrounding

sequences. Coding region IRs, like promoter IRs, may function at the DNA level. However

unlike IRs found in promoters, it is also possible that these coding region IRs are functional at

the RNA level. When the coding regions are transcribed into RNA, the IRs are still present and

can lead to the formation of hairpin secondary structures. These structures may be important for

the function of the RNA.

       There are three possible models for how transcription might be altered by IRs and/or

cruciform formation. First, transcription leads to the formation of positive supercoils ahead of

the polymerase and negative supercoils behind the polymerase {{233 Liu,L.F. 1987}}, and

cruciform formation has been shown to accompany transcription {{232 Dayn,A. 1992}}. If

these supercoils become too extensive, polymerase is not able to continue transcribing the DNA.

The formation of cruciform structures relieves negative supercoiling at a rate of one negative

supercoil per 10.5 DNA bases that adopt a cruciform structure. Cruciform structures that relieve

negative supercoiling may help transcription occur more rapidly. Second, there may be a

balance between cruciform formation and nucleosome occupancy, as both cruciform structures

and nucleosomes relieve negative supercoiling. The formation of a cruciform may stabilize the

loss of a nucleosome. If a transcription factor binding site is covered by a nucleosome, loss of

the nucleosome could allow transcription factor binding and gene activation. Third, there are

proteins that bind to cruciform structures in a sequence-independent manner that may be



                                                17
involved in transcription regulation or may recruit other proteins that are involved in regulating

transcription. Such proteins are known to exist {{43 Callejo,M. 2002}}, and their deletion

results in alterations in the transcriptional program of yeast {{143 Ichimura,T. 2004}}.

However, the ways in which they alter transcription and the precise types of cruciforms they bind

to are not yet known.

               While IRs that are A/T-rich are thought to form cruciform structures more readily

{{232 Dayn,A. 1992; 63 Zheng,G.X. 1988}}, we found a large number of A/T-only IRs, and

these A/T-only IRs were conserved at high levels, we found that they do not make genes more

likely to be transcriptionally active. While they may not be involved in transcriptional

regulation, A/T-only or A/T-rich IRs may be important for other DNA-mediated processes. We

counted overlapping IRs separately as long as they did not share a center of symmetry. Because

we found long (up to 40 bp) stretches of AT or TA repeats, and these sequences consist of many

shorter IRs with different centers of symmetry, this may have led to our finding a misleadingly

large number of A/T-only IRs. However, overlapping IRs appear throughout the genome at

approximately the same rate andHowever, it is unlikely that this low-complexity sequence is the

only reason for the overrepresentation of IRs in general. Even when A/T-only and G/C-only IRs

are excluded, we still observe an overrepresentation of IRs.

       We found a very small number of G/C-only IRs in the promoters and coding regions of

genes, and found that G/C-only IRs in coding regions tend to be associated with more

transcriptionally active genes. Genes in G/C-rich isochores tend to be more transcriptionally

active {{240 Dekker,J. 2007}}, so the presence of G/C-only IRs in active genes could be a result

of higher G/C content in many active genes.




                                                18
       In determining the locations of IRs that are conserved, and the locations of IRs in

transcriptionally active genes, we found that both conserved promoter IRs and promoter IRs in

transcriptionally active genes tend to be closer to the transcription start site. However, conserved

IRs are found in large numbers across the entire coding region with two peak locations for

conserved IRs. While IRs in promoters are likely most functional when they are closer to the

beginning of the coding region, in coding regions there are likely multiple functions of IRs that

are different depending on IR location.




                                                19
REFERENCES




             20
Table 1: Examples of inverted repeats and conserved inverted repeats that meet our definitions.
Sample IR               Palindromic Length          Spacer length            Sample conserved IR
                                                                             found in another
                                                                             species
GTACAaagTGTAC           10                          3                        GTACAtctTGTAC
AATTAATT                8                           0                        AATTAATT
Palindromic portions are in capital letters, non-palindromic spacers are in lower case letters.

Table 21. Sum of inverted repeats of each palindromic length found in the promoters and coding
regions of the S. cerevisiae genome.
Palindromic           6            8          10           12         14         16         18    20
Length
Promoter        258828        83143       26886        8078        2625         837        338        351
               p=1.000      p=1.000     p=0.998      p=0.111     p=0.000    p=0.000    p=0.000    p=0.000
              (280529.3)    (90771.1)   (28435.2)     (8632.5)   (2544.1)    (736.6)    (209.6)         (81.6)
Coding          675888       195791       56483        15286       4187       1148         310        141
               p=0.000      p=0.000     p=0.003      p=0.745     p=0.021    p=0.033    p=0.009    p=0.000
              (594870.0)   (181736.1)   (53333.7)    (15106.7)   (4156.8)   (1120.5)    (297.9)     (107.0)
Significance was determined by directly calculating the percentage of 1000 randomly generated
genomes (using a Monte Carlo Markov chain method; see Methods) which contained values
larger or smaller than those shown. Observed values that are significantly greater than those of
random genomes are shown in bold italics, while observed values that are significantly less than
those of random genomes are shown in bold. The p-value represents the percentage of random
genomes that are greater than the observed value. The expected number of inverted repeats
based on 1000 random genomes is shown in parenthesis.




                                                    21
Table 32. Conservation of inverted repeats.
Palindromic
Length                    6              8           10         12        14           16       18           20
                12906           2688            732         158        42         20         4            0
promoter       (1002.8)         (56.5)         (3.2)       (0.18)  (9.8x10-3) (5.3x10-4) (4.1x10-5)   (2.9x10-6)
                87028          15371           2740         376        62         14         5            1
coding        (64305.0)       (9203.2)       (1301.7)     (173.6)    (23.2)      (3.2)     (0.43)      (0.029)
Observed number of conserved inverted repeats. Expected number of conserved inverted repeats
based on conservation rates of 0.419 for promoters and 0.701 for coding regions {{86 Kellis,M.
2003}} is in parenthesis. No conserved inverted repeats with palindromic lengths longer than 20
bases were found. Observed values that are significantly greater than the corresponding
expected values are shown in bold italics. No observed values were found to be significantly less
than those of random genomes. Significance (p<0.001) was determined using a binomial
distribution.




                                                     22
Figure Legends
Figure 1. Algorithm for inverted repeat detection. (A) Phase 1 determines whether an IR is           Formatted: HTML Preformatted, Line spacing:
                                                                                                      Double

present and (B) Phase 2 detects the longest IR.                                                      Formatted: Font: Not Bold


                                                                                                     Formatted: HTML Preformatted


Fig. 1: Examples of inverted repeats. Palindromic portions are in capital letters, non-palindromic
cores are in lower case letters. Examples of inverted repeats that meet our definition (A) and an
Example of a conserved inverted repeat (B).

Figure. 2. Promoters (A) and coding regions (B) were grouped by their inverted repeat with the
longest palindromic length. The fraction of each of these groups of genes that overlaps with
genes in the top 20% of mRNA abundance was determined. For each of these comparisons,
hypergeometric p-values were computed. Palindromic lengths for which the fraction of genes in
the top 20% of transcriptional activity is significantly higher (p<0.01) than the expected 0.2 are
marked with an *.

Figure 3.: Histograms of IR location in coding regions (A and B) and promoters (C and D) are
shown for S. cerevisiae IRs (A and C) and conserved IRs (B and D). Promoter IRs are only
shown for the first 2000 bases before the start codon.




                                                  23

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:3/25/2013
language:Unknown
pages:23