Document Sample

ANALYSIS OF DNA SEQUENCE IN THE FRACTAL PERSPECTIVE: THE CHAOS GAME AND FRACTIONAL BROWNIAN MOTION Yoon-jung Choi MAT335H Term Project Professor: Randall Pyke Submission: May 20, 2003 University of Toronto, Department of Mathematics ABSTRACT Various mathematical methods have been applied to investigate the nature of DNA sequences. The chaos game representation of DNA sequences has been reported to produce a unique pattern consistently over different parts of the genome of an organism. From the image generated from the chaos game, characteristics of a DNA sequence can be studied, such as finding association between two letters. The concept of fractional Brownian motion has been also applied to DNA sequence leading to the discovery of the long-range correlation in DNA sequence. However, in order to fully understand the pattern in a DNA sequence, application of more than one method is desired. In this report, the upstream region (31,375bp) of human serotonin receptor 2A gene (HTR2A) was analyzed by using both the chaos game and fractional Brownian motion. The two methods compliment each other, and here, I suggest novel perspective in interpreting the data obtained form the chaos game and fractional Brownian motion. INTRODUCTION DNA, deoxyribonucleic acid, is composed of an extremely long array of nucleotides. Each nucleotide contains one of the four bases, adenine (A), guanine(G), which are purines (double ring structure), and cytosine (C), thymine(T), which are pyrimidines (single ring). An example of DNA sequence is …GTGATAGGGTCTCACTCTGT… In fact, this sequence in letter can be converted to a quaternary number sequence by changing T into 1, A into 2, C into 3, and G into 4, such as …41421244413132313141…. The length of DNA sequence in genome varies depending on species. For example, the size of genomic DNA sequence of a fruit fly is approximately 120Mbp (120X10^6 letters), a mouse has about 2900Mbp, and human has about 3213Mbp-long DNA sequence in the genome. This genomic sequence is what is contained in the whole set of chromosomes in the nucleus of a single cell. It is a remarkable phenomenon that DNA sequence contained in a cell dictates development of a complete, mature organism from one single cell. Scientists have attempted to decipher the structure and meaning of DNA sequences; however, consensus has not been reached and opinions are diverged. A prevalent method for DNA analysis is related to random walk or Brownian motion which led to the discovery of long-range correlation in DNA sequences (Peng et al., 1992; Voss, 1992; Chatzidimitriou-Dreismann et al., 1994). Another more recent approach is applying the chaos game (Jeffrey, 1990; Deschavanne et al., 1999; Almeida et al., 2001). However, these two methods appear to run in parallel without a focal point. In the first part of this report, the application of the chaos game to a selected DNA sequence will be described. The chaos game representation of DNA sequence led to the application of fractional Brownian motion, which will be explained in the second part. Lastly in the third part, the procedure of coordination between the data from the two independent methods will be elaborated, and the relevance of the data coordination in DNA analysis will be emphasized in the discussion section. The sequence studied in this report was the 5’ region (31,375bp) of the human serotonin receptor 2A gene (HTR2A), extracted from genomic sequence available at GenBank (see reference). Various medical studies proposed that HTR2A is associated with Schizophrenia, bipolar disorder, seasonal affective disorder, and suicidal behaviors (see reference-data source). I. APPLICATION OF CHAOS GAME TO ANALYSIS OF A DNA SEQUENCE Drawing Sierpinsky Triangle by Chaos Game Sierpinsky triangle is a fractal structure, that is, a part of structure resembles the entire structure, thus, self-similar (Fig.1). One way to draw a fractal structure is by playing the chaos game. First, an equilateral triangle with a number 1,2, and 3 written at each corner is used as a game board, and an arbitrary point inside or outside of the triangle is used as the initial game point z0 (Fig.1). Then a number 1, 2, or 3 is randomly chosen, for example, by rolling a die with each number written twice. Suppose 1 was chosen. Then we move z0 to midpoint between point 1 and z0, generating the next game point z1. Similarly, z2 is obtained by moving z1 to the midpoint between the next number chosen, say 3, and the previous point, z1. Repeating this procedure eventually produces the image of Sierpinsky triangle. The principle of the chaos game is that the current point is determined by the previous point by a funciont wi. zk+1=wi(zk), where zk denotes the game point at step k, and wi is the iterative function which tells the movement of game points. Each game point zk wk ( wk 1 ( ( w2 ( w1 ( z0 ) ) is assigned an address sk sk 1 s2 s1 , where s n is the number chosen from the die at nth step. Thus, playing k steps of the chaos game would generate an address of k numbers long, which is a tertiary number sequence. As mentioned in the introduction, DNA sequence can be represented by a quaternary number sequence. The chaos game can be modified in such a way that four numbers are used to generate an image for a DNA sequence. zk+1= w k i (z k) z k has an address in length k Sequence: 113213 e.g. 21321323312….213213213 (a tertiary sequence) Fig.1. The chaos game for Sierpinsky triangle (the third image). Chaos Game Representation of a DNA Sequence Although various methods can be developed for the chaos game representation of DNA sequences, the following is the most prevalent method reported in literature. Each corner of a square 1 by 1 is written T at (0,0), A at (0,1), C at (1,1), and G at (1,0), and the initial point z0 drawn at (0.5, 0,5) (Almeida et al., 2001). Unlike in the chaos game for Sierpinsky triangle where the number is randomly selected, the chaos game for a DNA sequence is play according to the given DNA sequence. For example, for a sequence ATGCGAGT…., the first game point z1 is obtained by moving z0(0.5, 0.5) to the midpoint between z0 and A(0,1) (Fig.2). Likewise, z2 is drawn at the midpoint between T(0,0) and z1. In general, z k+1 = wi(zk-1) = z k + 1/2[z k – q k+1] where q k+1 is the position of one of the four letters at step k+1. For example, q5 would be G(1,0) in this case. As described earlier, zk is assigned an address sk sk 1 s2 s1 ; thus, in case of the sequence above, z8 would have an address TGAGCGTA. z k : game point at step k Sequence: ATGCGAGT… qk : fixed point corresponding to sk for sequence s1s2…sk-1sk z k+1 = z k + ½ [z k – q k+1] Fig.2. The chaos game for DNA sequence (Almeida et al., 2001). Fig.3a shows the result of the chaos game for 31,375bp of serotonin receptor 2A gene by using Dnacgr (Chaos Game Representation of DNA sequence) program (see reference). This image is remarkably similar to the ones reported in the literature. Chaos game of human globin region (73,357bp) (Jeffrey, 1990), human intron sequences (Solovyev, 1993), and randomly selected human DNA sequence of 100Kbp long (Deschavanne et al., 1999) resemble the image shown in Fig.3a. Interestingly, Deschavanne et al. (1999) found out from other species that the images obtained from parts of genome presented the same structure as that of whole genome. They also showed that different organisms exhibited different patterns in the images. As mentioned by Deschavanne et al. (1999), a closer look at the Fig.3a reveals two major features of the DNA sequence. First, the empty patches indicate that the areas which have GC in their addresses have notably low density. This means that the probability of CG occurring along the sequence is very low. The particular shape of the empty patches is repeated in different scales. Especially, the sub-quadrant T, A, and C resemble the entire structure. Yet, the image as a whole is not a fractal in a precise sense because sub-quadrant G is not self-similar. In order for an image to be a fractal, any part of the image should regenerate the entire structure when rescaled. Second, the diagonal lines imply the prevalence of AA, AG, GA, GG, TT, CT, TC, and CC (Fig.3a). Yet, these features do not directly tell us if the sequence is random or not, even though they indirectly suggest non-randomness of the sequence. To answer this question more clearly, and to provide explicit evidence, further experiments have been performed. Interpretation of the Image from the Chaos Game for the DNA sequence The probability of each letter can be also calculated by using Dnacgr program. The input sequence had probability 0.3127 for A, 0.1887 for C, 0.2925 for T, and 0.2061 for G. In order to test the role of these different probabilities in generating the distinct pattern, the sequence was shuffled and used as an input for the program. Note that shuffling does not alter the probability of each letter, but it only breaks the order of letters in the sequence. Fig.3b shows the image generated from the shuffled sequence, which is very different from the original image; the empty patches and diagonal lines disappeared. The gradient observed is due to the different probabilities. Since disrupting the order of letters destroys the original image, the sequence must have certain patterns in the order, thus not random. If the original sequence was random, the image from the shuffled sequence would have been the same as that from the original sequence. In fact, when the shuffled sequence was shuffled again, the resultant image was similar to the image obtained from the shuffled sequence, which implies that the order of letters in the shuffled sequence was indeed random (image not shown). To further ensure that the shuffled sequence had random order of letters, the known probabilities were entered into the full square chaos game applet (see reference), which uses random numbers to generate game points (Fig.4). The image was similar to the previous two images obtain from shuffled and reshuffled sequences. Since random order of letters with the fixed probabilities abolishes the original pattern, the factor primarily responsible for the original image must be the order rather than the probability of each letter. Fig.3. The chaos a. game image of the human serotonin receptor 2A region (31,375bp). a: original sequence, b: shuffled sequence. Note A(0,1), C(1,1) T(0,0), G(1,0) . b . 3(C) b. 2(A) 3(C) 1(T) 4(G) Fig.4. Full square chaos game with random number with probability fixed as in the original sequence. Probability of T: 0.2998, A: 0.3004, C: 0.1968, and G: 0.2030 Then how can we test (instead of assuming) that the empty patches and diagonal lines are resulted from biased association between two letters? Using a modified chaos game applet (see reference), ‘34’, representing ‘CG’, was entered for substring so that CG could be eliminated from the chaos game. This applet uses random numbers from 1 to 4, with the probability adjusted as earlier. As a result, the modified chaos game imitated the pattern of empty patches in the original image. Thus, the empty patches can be indeed characterized by CG depletion in the sequence. However, the diagonal lines were still not present in the image from this simulation. Demonstrating the reason for diagonal lines is more difficult with chaos game, and an alternative method will be introduced in the next section. Chaos game can provide us with a quick overview of characteristics of the given sequence; however, it has a limitation in interpreting a sequence. Essentially, the image from the chaos game does not tell us the order of the game points. For example, when the sequence from second half of serotonin receptor 2A gene (31376bp) was tested for chaos game, the image was indistinguishable from the original image which used the first half of 5HT2R (data not shown). The two sequences were found to be completely different when tested by Clustal W (see reference), a program that aligns different sequences in parallel and matches the common letters. Thus, having the same image does not mean the sequences are also similar, although the sequences must be sharing certain characteristics. Therefore, a more rigorous approach was required to trace the order of letters in the sequence such as plotting DNA sequence in a time series format (Peng et al., 1992). 2(A) 3(C) 1(T) 4(G) Fig.5. Modified chaos game (full square) with the fixed probability and substring 34. II. APPLICATION OF FRACTAL BROWNIAN MOTION TO ANALYSIS OF A DNA SEQUENCE DNA walk The motion of Brownian particle consists of steps of movement in a characteristic length in a random direction; thus, it’s also called a random walk (Feder, 1988). Suppose the particle moves on the x-axis by jumping + or - every seconds, then its movement can be plotted as time proceeds. Likewise, DNA sequence can be plotted in a form of time-series, but the x-axis represents an array of DNA sequence instead of time (Peng et al., 1992). This way, the profile of letters can be preserved along the sequence unlike in the chaos game. DNA Walker program (see reference) was used to generate the plot of ‘random’ walk of the given sequence (Fig.6). For one-dimensional DNA walk, purine-pyrimdine skew scale was used, in which G,A (purines) were moved +1, and C,T (pyrimidines) were moved -1. Plotting the original sequence was followed by the shuffled sequence, which then placed together for easier comparison. The shuffled sequence exhibited a tendency to oscillate closer to the x-axis than the original sequence which drifted further down and moved back towards the x-axis. The tendency to move constantly up or constantly down suggests that purines tend to be associated with purines, and pyrimidines tend to pair with pyrimidines. This agrees with the game points that are concentrated at the diagonal lines in the chaos game. The region of diagonal lines contain the address AA, GG, AG, GA (purine pairs) or CC, TT, TC, CT (pyrimidine pairs). Then, how can this difference between the original and shuffled sequence be numerically represented? Borovik et al. (1994) showed the presence of long-range correlation in DNA sequences by R/S analysis of DNA walk. X(l)(x10^2) ) Shuffled sequence Original sequence l (x10^4) (nucleotide distance) Fig.6. The plots of DNA walk of original sequence (bottom, red), and shuffled sequence (top, blue) (31,375bp) show cumulative movements as the sequence proceeds: +1 for A,G (purines) and -1 for T,C (pyrimidines). For the original sequence, the maximum point is at 225 and the minimum point is at -520 along the y-axis. For the shuffled sequence, max:250, min:-50. R/S Analysis and Hurst Exponent R/S analysis, or rescaled range analysis, was first invented by Hurst, who spent his lifetime studying the Nile River and water storage (Feder, 1988). Let (t) be the annual discharge of water from a dam at year t. Let X(t) be the accumulated departures of (t) from the mean <> ( = t2 – t1). X (t , ) ξ(u ) ξ . t (1) u 1 Then, the range R is the difference between the maximum and minimum amounts of water contained in a sufficiently large dam that never empties or outflows (Fig.7). This can be written as R = max X(t, ) – min X(t, ) , 1 t . (2) On the other hand, S, the standard deviation, is written as 1/ 2 1 t S ξ(u ) ξ 2 . (3) u 1 Hurst empirically found out from the data of natural phenomena such as river discharges, lake levels, and rainfall that there was a relation between the rescaled range, R/S, and an exponent K, now called Hurst exponent,H, such that H R/S . (4) 2 The data that Hurst collected from natural phenomena produced ~0.73 for mean H. On the other hand, the data generated by statistically independent process produced H=0.5. a. b. Fig.7. a: Sketch of a reservoir with an influx of (t). The range, R, is the difference between the max. and the min. contents of the reservoir. b: Lake Albert annual discharge (t) (dotted line), and accumulated departure from the mean discharge, X(t)(solid line). The range is indicated by R (Feder, 1988). Fractional Brownian Motion Introduced by Mandelbrot, fractional Brownian motion is a generalization of X(t) by modifying H=1/2 to 0<H<1, where X(t), the position of a Brownian particle, is a random function of time t (Feder, 1988). For X(t) – X(t 0) ~ t - t 0 H , H = ½ for ordinary Brownian motion, in which the displacement of the particle is independent of previous displacements, thus a random process. On the other hand, when H >1/2, displacement of the particle is positively influenced by the displacement in the past. That is, if the particle moved + at step i, it tends to move + as well at step i+1. If the particle moved - previously, it is likely to move - in the next step. This type of behavior is called persistence. For H < ½, we have antipersistence, where the particle tends to move in opposite direction from the previous displacement (Table 1). Hurst character of particle particle displacement particle displacement (average) exponent movement at step i at step i+1 H >1/2 persistence positive Positive negative Negative H =1/2 independence no correlation (Brownian motion) H < 1/2 antipersistence positive Negative negative Positive Table 1. Properties of particle movement according to the value of Hurst exponent. R/S Analysis of the DNA Sequence: Estimation of Hurst Exponent Applying this concept to DNA sequence, Hurst exponent can be calculated from DNA walk. The same principle introduced earlier is applied to DNA sequence. From (1), s X ( s, l ) {ξ(u ) ξ l } , u 1 where s is a letter on the sequence of l letters long, and 1 l ξ l ξ( s ) . l s 1 Calculated from table 2, the sum of movements for the entire sequence of length l would be l ξ( s) = 9424 + 6371- 9405-6175 = 215. s 1 number of nucleotide probability movement occurrence (bp) A 0.2998 9424 1 G 0.3004 6371 1 T 0.1968 9405 -1 Table 2. Probabiliy, number of C 0.203 6175 -1 occurrence (bp), and movement of each Total 1 31375 nucleotide. Then, 1 l 215 ξ l ξ(s) 31375 0.0068526 0 . l s 1 (5) thus, l l l X ( s, l ) {ξ(u ) ξ l } {ξ(u ) 0} ξ(u ) (6) u 1 u 1 u 1 From (2) and (3), with adequate letter conversions, R(l ) max X ( s, l ) min X ( s, l ) 1/ 2 1 l S {ξ( s) ξ l }2 . (7) l s 1 From (6), s s R(l ) max ξ(u ) min ξ(u ) . u 1 u 1 s Since ξ(u ) is u 1 the position of a letter s along the y-axis, R(l) is equivalent to the difference between the maximum point and the minimum point on the DNA walk; thus, from Fig.6, R(l) = 225-(-520) = 745. From (5) and (7), 1/ 2 1/ 2 1/ 2 1/ 2 1 l 2 1 l 2 1 1 S ξ( s) 0 ξ( s) 31375 31375 1 . (8) l s 1 l s 1 l 31375 From (4) and (8), H l R / S R(l ) . 2 Consequently, log R(l ) log 745 H= 0.685 . log(l / 2) log 31375 / 2 For the shuffled sequence (Fig.6), R(l)=250-(-50)=300. Thus, log R(l ) log 300 H= 0.590 . log(l / 2) log 31375 / 2 In fact, estimation of Hurst exponent by R/S analysis is more laborious than this naïve estimation. R/S value is calculated for l, l/2, l/4,…, and 1/2n, and for each division of l, average R/S is calculated again. Then a linear regression line is obtained from plotting log(R/S) versus logl. Then, the slope of the linear graph is the estimated Hurst exponent (Kaplan, 2003). Instead of calculating every step manually, Hurst exponent was automatically estimated by SELFIS (SELF-similarity analysis) program (see reference). The input data was modified from letter sequence to a number sequence where purines (A,G) were converted to 1, and pyrimidines (T,C) were converted to -1. Consequently, the original sequence produced H=0.639 whereas the shuffled sequence had H = 0.553, which shows that the naïve estimation was overestimated, yet more or less similar (Fig.8). a. b. Fig.8. Estimation of Hurst exponent for 31375bp DNA sequence of serotonin receptor 2A region by SELFIS.a: H= 0.639 for the original sequence, b: H= 0.553 for the shuffled sequence. These values suggest that the there exists persistence in the original sequence at a greater level compared to the shuffled sequence. H of shuffled sequence is closer to theoretical value H=1/2 for ordinary Brownian motion. This observation is relevant in both DNA walk and chaos game. Constant downward or upward displacement for a long range of the sequence in DNA walk and the diagonal lines in the chaos game can be explained by persistence, strongly supported by the numerical value of H>1/2. Also, these values are comparable with the published H values for DNA sequences (Table 3). Sequences of random characters show H closer to ½. sequence H sequence in comparison H references human beta-cardiac myosin human beta-cardiac myosin heavy Peng et al., 0.67 0.49 heavy chain gene chain cDNA 1992 human beta globin purine- human beta globin (A,C)-(G,T) 0.708 0.515 Borovik et al., pyrimidine representation representation synthetic model sequence 0.655 random noncorrelated sequence 0.517 Borovik et al., human serotonin receptor 2 gene, human serotonin receptor 2 gene 0.639 0.553 this report shuffled Table 3. Comparison between H values published for various sequences and H value measured in this report. III. COORDINATING THE DATA FROM CHAOS GAME AND DNA WALK Expression of Game Points in the Chaos Game Iterative functions for the DNA chaos game can be written differently for each nucleotide. 1/ 2 0 x 0 T : w1 0 1/ 2 y 0 1/ 2 0 x 0 A : w2 0 1/ 2 y 1/ 2 1/ 2 0 x 1/ 2 C : w3 0 1/ 2 y 1/ 2 1/ 2 0 x 1/ 2 G : w4 0 1/ 2 y 0 In general, 1/ 2 0 x 1 a wi 0 1/ 2 y 2 b where, ai 0 ai 0 for i 1 ;T for i 2 ; A bi 0 bi 1 . (9) ai 1 ai 1 for i 3 ;C for i 4 ;G bi 1 bi 0 At kth step, the game point zk can be expressed as xk 1/ 2 0 xk 1 1 aik yk 0 1/ 2 yk 1 2 bik Hence, zk wk ( wk 1 ( ( w2 ( w1 ( z0 ) ) . x 1/ 2 0 x0 1 ai1 x0 1/ 2 For k 1, 1 y 2 b , where y 1/ 2 . Thus, y1 0 1/ 2 0 i 0 1 x1 1/ 22 1 ai1 2 . y1 1/ 2 2 bi1 x 1/ 2 0 x1 1 ai2 For k 2, 2 y2 0 1/ 2 y1 2 bi2 1/ 2 0 1/ 22 1 ai1 1 ai2 2 0 1/ 2 1/ 2 2 bi1 2 bi2 1/ 23 1 ai1 1 ai2 1/ 23 22 bi1 2 bi2 x 1/ 2n 1 1 ai1 1 ai2 1 ain1 1 ain For k n, n n 1 n n 1 yn 1/ 2 2 bi1 2 bi2 22 bin1 2 bin 1 1 1 1 1 2n 1 2n ai1 2n 1 ai2 22 ain1 2 ain 1 1 b 1 b 1 b 1 b n 1 i 2 2n 1 2n 1 2 22 n1 2 n i i i .ain ain1 ...ai2 ai1 1 , where aik , bik 0,1 , i 1, 2,3, 4 .bi bi ...bi bi 1 n n1 2 1 (10) Therefore, a game point zk can be represented by binary expansion. This means that converting (x,y) coordinate of a game point can be converted into a binary expansion, which is useful in finding the corresponding location of the game point on DNA walk (see below). This expression (10) is also compatible with sk sk 1 s2 s1 , the address ain assignment of zk. Each sn can be represented by , which is parallel to the expression bi n in (10) when extended from n to 1. Expression of Position of DNA Walk Recalling (9), (s), displacement at s, can be written as ai bi 1 1, if ai bi ξ (s) ai bi { * . (11) ai bi 1, if ai bi Thus, the position of s equivalent to s X ( sk ) ξ (u ) ai bi 1 ai bi 2 ai bi k n(ai bi ) n(ai bi ) , * * * u=1 where n(Y) denotes the number of cases of Y. Implications Since a game point can be represented by x,y coordinate in binary expansion, each point can give a value for X ( sk ) by counting n(ai bi ) and n(ai bi ) . Then this integer value can be interpolated on the plot of DNA walk to find the corresponding s. Unfortunately, this process is not straightforward because there can be more than one, in fact, many, values that have the same X ( sk ) . One-to-one projection from a single point on the chaos game to a single s on DNA walk is difficult because the chaos game is two- dimensional while DNA walk is one-dimensional. The reasoning that the constant downward or upward displacement of letters on DNA walk is associated with the diagonal lines from the chaos game can be explicitly proved by the following approach even though it seems obvious. First the points concentrated on the diagonal lines are converted to binary expansions and X ( sk ) is computed accordingly. This intuitively suggests that the values would be largely negative or positive since the points are near the lines y = x or y = -x+1. For example, for a game point positioned at 0.10, 0.10 , will have n(ai bi ) while n(ai bi ) = 0, thus largely negative (11). For a game point at 0.10, 0.01 , y -x +1, and n(ai bi ) , n(ai bi ) n(ai bi ) , thus largely positive (11). n(ai bi ) The higher the X ( sk ) is, the less the number of corresponding s will be found on DNA walk, simply because the position X ( sk ) farther from the x-axis is less likely to be found at other s’s. For instance, from Fig.6, there is only one s for maximum X ( sk ) and minimum X ( sk ) , respectively. Consequently, the average X (sk ) obtained from points on chaos game for the shuffled sequence would be lower than that for the original sequence. Although this is already implied from the R/S analysis, relating the binary expansion and X ( sk ) provides another perspective to view the different methods as a whole. DISCUSSION The square 1x1 is divided N times resulting in 4^N sub-squares, which we call quadrant qij. Supposed the square was divided 5 times. There would be 4^5 squares and 2^5 subsections each on x-axis and y-axis. We choose qij i 20 for example (Fig.9). j 9 This quadrant has address GATGG, which is sk sk 1sk 2 sk 3 sk 4 . Thus, the game points located within qij i 20 has address GATGG s k 5 s2 s1 , where 5 k 31375 in case of j 9 this report. In an alternative view, the points can be located along the sequence where a fragment of the sequence end with GGTAG, the reverse of the address (Fig.9). Therefore, if there are n points in the quadrant qij i 20 , there will be also n segments of the sequence j 9 which ends with GGTAG. A C j=16 Sequence of length l s1 …GGTAG …GGTAG …GGTAG T qij i 20 G i=16 j 9 i=24 Fig.9. The square is subdivided 5 times resulting in 4^5 sub-squares. The x, y-axis can be divided into 32 sections: i = 1~32 for x-axis, and j = 1~32 for y-axis. qij i 20 has address j 9 GATGG…, which is GGTAG along the sequence. The location of n game points positioned in qij i 20 can be found n times along the sequence. The length of address for j 9 each game point in qij i 20 varies. j 9 Accordingly, GGTAG gives +1,+1,-1,+1,+1 displacement on DNA walk. However, a major confounding problem is that there are 2^5 different combination of letters which results in the same displacement. This might be solved by applying two- dimensional DNA walk, which is a reasonable candidate for the future study. If two- dimensional DNA walk was used, the position of GGTAG can be located along the sequence presumably without the confounding factor. The positions may reveal periodicity of certain fragments in the DNA sequence. The higher the density of qij , the higher the frequency of the specific fragment of the sequence specific to the qij . The value of the chaos game should be reminded that it enables easy comparison of frequency of every combination of the four letters. For instance, for N=5, where N is the number of division, the frequency of 4^5 different fragments of 5 letters long can be obtained simultaneously. Furthermore, the smaller the quadrant, the longer the fragments of DNA sequence that can be located along the given sequence. Tracing the letters on the plot of DNA walk corresponding to qij on the chaos game can be repeated for every ij, and for every N=1,2,…k,. This can provide not only a gross but also detailed look into the profile of a DNA sequence, such as answering what fragments occur where and how frequently. This approach, which comes from the merge between the chaos game and DNA walk, offers an insight into developing an algorithm for a software which can detect unknown nucleotide repeats in an input sequence (of course, locating known nucleotide repeats is easy!). Ability to search nucleotide repeats in many different lengths might be relevant in biological and medical studies, such as finding a new transposable elements. CONCLUSIONS Serotonin receptor 2A gene (31375bp) was analyzed by using the chaos game and fractional Brownian motion, or DNA walk. As a result, CG depletion and purine-purine, pyrimidine-pyrimidine association were observed and explained through computer simulation or mathematical reasoning. Explanation of patterns observed in the chaos game and DNA walk was facilitated by mutual understanding of the both methods. One dimensional DNA walk created time series plot of the DNA sequence and enabled estimation of Hurst exponent, which led to the finding of persistence of the sequence. However, coordinating the data from the chaos game and one-dimensional DNA walk is limited mainly due to the different dimension in each method. Application of two- dimensional DNA might solve this problem, and is recommended for a future study. ACKNOWLEDGEMENTS I would like to thank Dr. Randall Pyke for encouragement and helpful discussions, Joseph Mocanu for solving technical problems regarding computer programs, and Thomas Karagiannis for guide to his SELFIS program. REFERENCES Software Chaos game, modified chaos game applets http://www.math.toronto.edu/courses/335/ Clustal W http://clustalw.genome.ad.jp/ Dnacgr 2.0 (Chaos Game Representation of DNA): Indraneel Majumdar, 2000 bioinformatics.org/cgi-bin/cvsweb.cgi/dnacgr/ DNA Walker: Department of Biochemistry and Microbiology, University of Victoria, 2003 http://athena.bioc.uvic.ca/pbr/walk SELFIS (SELF similarity analysIS): Thomas Karagiannis, University of California at Reverside, 2001 http://www.google.ca/search?q=cache:kcabiNQzLLkC:www.cs.ucr.edu/~tkarag/Selfis/Se lfis.html+hurst+exponent+download+download+-benoit+-order+- filetype:pdf&hl=en&ie=UTF-8 Shuffle DNA http://www.gchelpdesk.ualberta.ca/downloads/shuffle_dna.html Data source Serotonin receptor 2A gene (HTR2A) sequence: GenBank http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NT_024524.12&from=15982005&t o=16044755&txt=on&view=fasta Sequence information http://www.ncbi.nlm.nih.gov/LocusLink/LocRpt.cgi?l=3356 Literature Cited Almeida, J.S., Carrico, J.A., Maretzek, A., Noble, P.A., and Fletcher, M.(2002). Analysis of genomic sequence by Chaos Game Representation. Bioinformatics, 17: 429-437 Borovik, A.S., Grosberg, A.Y., and Frank-Kamenetskii, M.D.(1994). Fractality of DNA texts. J. Biomol. Structure & Dynamics, 12: 655-669 Chatzidimitriou-Dreismann, C.A., Friedrich Streffer, R.M., and Larhammar, D. (1994). Variations in base pair composition and associated long-range correlations in DNA sequences – computer simulation results. Biochemica et biophysica Acta, 1217: 181-187 Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G., and Fertil, B.(1999). Genomic signatureA: characterization and classification of species assessed by chaos game representation of sequences. Mol.Biol.Evol., 16(10):1391-1399 Feder,J.(1998) Fractals., Plenum Press, NY & London Jeffrey, H.J. (1990). Chaos game representation of gene structure. Nuc.Acids Res., 18:2163-2170 Kaplan, I. (2003) http://www.bearcave.com/misl/misl_tech/wavelets/hurst/ Mandelbrot, B.B. (1982). The Fractal Geometry of Nature, Freeman & Co., New York Peng, C.K., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Sciortino, F., Simons, M., and Stanley, H.E. (1992). Long-range correlations in nucleotide sequences. Nature, 356: 168- 170 Solovyev, V.V. (1993). Fractal graphical representation and analysis of DNA and protein sequences. Bio.Systems, 30: 137-160 Voss, R.F. (1992). Evolution of long-range fractal correlation and 1/f noise in DNA sequences. Phy.Rev.Let., 68: 3805-3808

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 6 |

posted: | 7/27/2012 |

language: | |

pages: | 18 |

OTHER DOCS BY GEn96LVy

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.