Application of Fuzzy Composition Relation For DNA Sequence Classification
IJCSIS is an open access publishing venue for research in general computer science and information security. Target Audience: IT academics, university IT faculties; industry IT departments; government departments; the mobile industry and computing industry. Coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; computer science, computer applications, multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management. The average paper acceptance rate for IJCSIS issues is kept at 25-30% with an aim to provide selective research work of quality in the areas of computer science and engineering. Thanks for your contributions in September 2010 issue and we are grateful to the experienced team of reviewers for providing valuable comments.
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8,, No. 6, 2010 Application of Fuzzy Composition Relation For DNA Sequence Classification Amrita Priyam B. M. Karan+, G. Sahoo++ + Dept. of Computer Science and Engineering Dept. of Electrical and Electronics Engineering ++ Birla Institute of Technology Dept. of Information Technology Ranchi, India. Birla Institute of Technology email@example.com Ranchi, India Abstract— Abstract— This paper presents a probabilistic class has a set of transition probabilities associated with it. approach for DNA sequence analysis. A DNA sequence This transition probability is the measure of going from one consists of an arrangement of the four nucleotides A, C, T state to another. Now, each class has a set of transition and G. There are various representation schemes for a probability associated with it. We further group the similar DNA sequence. This paper uses a representation scheme in classes and their respective transition probability is merged which the probability of a symbol depends only on the using a fuzzy composition relation. Finally a log-odd ratio is occurrence of the previous symbol. This type of model is used for deciding to which class the given sequence belongs defined by two parameters, a set of states Q, which emit to. symbols and a set of transitions between the states. Each II. DNA SEQUENCES transition has an associated transition probability, aij , which represents the conditional probability of going to DNA sequence is a succession of letters representing the state j in the next step, given that the current state is i. primary structure of a real or hypothetical DNA molecule or Further, the paper combines the different types of strand, with the capacity to carry information as described by classification classes using a Fuzzy composition relation. the central dogma of molecular biology. There are 4 Finally a log-odd ratio is used for deciding to which class nucleotide bases (A – Adenine, C – Cytosine, G – Guanine, T the given sequence belongs to. – Thymine). DNA sequencing is the process of determining the exact order of the bases A, T, C and G in a piece of DNA Keywords-component; Transition Probability, Fuzzy Composition . In essence, the DNA is used as a template to generate a set Relation, Log-Odd ratio of fragments that differ in length from each other by a single base. The fragments are then separated by size, and the bases at the end are identified, recreating the original sequence of I. INTRODUCTION the DNA. The most commonly used method of A DNA sequence is a succession of the letters A, C, T and G. sequencing DNA the dideoxy or chain termination method The sequences are any combination of these letters. A physical was developed by Fred Sanger in 1977 (for which he won his or mathematical model of a system produces a sequence of second Nobel Prize). The key to the method is the use of symbols according to a certain probability associated with modified bases called dideoxy bases; when a piece of DNA is them. This is known as a stochastic process, that is, it is a being replicated and a dideoxy base is incorporated into the mathematical model for a biological system which is governed new chain, it stops the replication reaction. by a set of probability measure. The occurrence of the letters can lead us to the further study of genetic disorder. There are Most DNA sequencing is carried out using the chain various representation schemes for a DNA sequence. This termination method . This involves the synthesis of new paper uses a representation scheme in which the probability of DNA strands on a single standard template and the random a symbol depends only on the occurrence of the previous incorporation of chain-terminating nucleotide analogues. The symbol and not on any other symbol before that. This type of chain termination method produces a set of DNA molecules model is defined by two parameters, a set of states Q, which differing in length by one nucleotide. The last base in each emit symbols and a set of transitions between the states. Each molecule can be identified by way of a unique label. transition has on associated transition probability, aij , which Separation of these DNA molecules according to size places them in correct order to read off the sequence. represents the conditional probability of going to state jfrom state i in the next step, given that the current state is i. Each 145 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8,, No. 6, 2010 III. A PROBABILISTIC APPROACH FOR SEQUENCE L (2) REPRESENTATION P ( x) = ∏ a xi −1xi i =1 A DNA sequence is essentially represented as a string of four characters A, C, T, G and looks something like ACCTGACCTTACG. These strings can also be represented in If there are n classes, then we calculate the probability of a terms of some probability measures and using these measures sequence x being in all the classes. To overcome this it can be depicted graphically as well. This graphical drawback we use Fuzzy composition relation. That is, we representation matches the Markov Hidden Model. A physical divide the n classes into different groups based on their or mathematical model of a system produces a sequence of similarities. So, if out of n classes, m are similar then they are symbols according to a certain probability associated with treated as one group and their individual transition probability them. This is known as a stochastic process . There are tables are merged using the fuzzy composition relation. The different ways to use probabilities for depicting the DNA remaining (n – m) classes are similarly grouped. Lets say, if sequences. The diagrammatical representation can be shown there are two classes R1 and R2, the Fuzzy composition as follows: relation between R1 and R2  can be written as follows: R1 o R2 = Max( Min( R1 ( x), R2 ( y ))) (3) Different class representation Grouping of similar classes Fig 2: Grouping of similar classes A table is then constructed representing the entire (n – m) similar classes. From this table we compute the probability FIG 1: [The states of A, C, G and T.] that a sequence x belongs to a given group using the following For example, the transition probability from state G to state T equation: is 0.08, i,e, P( xi = T | xi −1 = G ) = 0.08 P( x | +) L a + i−1 x (4) log = ∑ log − i P( x | −) i=1 ai−1 xi In a given sequence x of length L, x1, x2, …… xL, represent the nucleotides. The sequence starts at the first state x1, and makes Here “+” represents transition probability of the sequence successive transitions to x2, x3 and so on, till xL. Using Markov belonging to one of the classes using fuzzy composition property , the probability of xL, depends on the value of relation and “-“ represents the transition probability of the only the previous state, xL-1, not on the entire previous same for another class . sequence. This characteristic is known as Markov property  and can be written as: If this ratio is greater than zero then we can say that the sequence x is from the first class else from the other one. P( x) = P( xL | xL−1 ) P( xL−1 | xL−2 ).......P( x2 | x1 ) P( x1 ) An Example: L (1) = P ( x1 )∏ P ( xi | xi −1 ) Let us consider an example for applying this classification i =2 method. We have taken into consideration the Swine flu data. The different categories of the Swine flu data are In Equation (1) we need to specify P(x1), the probability of the shown as R1, R2 and R3. starting state. For simplicity, we would like to model this as a transition too. This can be done by adding a begin state, R1, R2 and R3 shows the Transition Probability of Type 1, Type denoted by 0, so that the starting state becomes x0=0. 2 and Type 3 varieties of Avian Flu. Now considering axi −1x , the transition probability we can i rewrite (1) as 146 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8,, No. 6, 2010 A C T G A ⎡0.09 0.06 0.06 0.06 ⎤ A ⎡0.13 0.06 0.09 0.08⎤ ⎢0.07 0.05 0.06 0.07 ⎥ ⎢ C C ⎢0.09 0.04 0.05 0.02⎥ ⎥ X2 = ⎢ ⎢0.04 ⎥ R1 = T 0.05 0.07 0.07 ⎥ T ⎢0.06 0.05 0.05 0.07⎥ ⎢ ⎥ ⎢ ⎥ G ⎣0.07 0.09 0.04 0.05⎦ G ⎣0.08 0.04 0.05 0.06⎦ A C T G A C T G A ⎡0.13 0.06 0.09 0.08 ⎤ A ⎡0.07 0.05 0.07 0.04⎤ ⎢0.08 0.04 0.04 0.02 ⎥ ⎢0.05 0.06 0.06 0.08 ⎥ C ⎢ ⎥ C ⎢ ⎥ R2 = X3 = T ⎢0.06 0.05 0.06 0.07 ⎥ T ⎢0.03 0.06 0.04 0.09 ⎥ ⎢ ⎥ ⎢ ⎥ G ⎣0.08 0.04 0.04 0.06 ⎦ G ⎣0.06 0.09 0.06 0.07 ⎦ Applying Fuzzy composition relation to these tables we get A C T G the final table as A ⎡0.08 0.05 0.08 0.08 ⎤ ⎢0.08 0.03 0.06 0.02 ⎥ A C T G C ⎢ ⎥ ⎡0.07 0.06⎤ R3 = A 0.06 0.05 T ⎢0.05 0.06 0.06 0.09⎥ ⎢0.08 0.09 0.07 0.06⎥ ⎢ ⎥ C ⎢ ⎥ G ⎣0.08 0.06 0.04 0.07⎦ T ⎢0.04 0.03 0.07 0.06⎥ ⎢ ⎥ G ⎣0.05 0.08 0.04 0.02 ⎦ Using the Fuzzy composition relation technique on R1 and R2 and then the result of the application with relation R3 we get Suppose we are given the sequence x which is to be classified the final table for the Swine Flu class as: as either falling into any of the given classes and say x = CGCG A C T G From the final fuzzy composition table the log odds ratio of A ⎡0.08 0.06 0.08 0.09 ⎤ this sequence is: ⎢0.08 0.06 0.08 0.09 ⎥ C ⎢ ⎥ 0.09 0.06 0.09 ⎢0.07 log + log + log = 0.032270 T 0.06 0.07 0.07 ⎥ 0.08 0.07 0.08 ⎢ ⎥ G ⎣0.08 0.06 0.08 0.08 ⎦ Now, since this ratio is greater than 0, we can conclude that the input sequence x belongs to the class Avian Flu. If further Similarly, we can repeat the same procedure for another class classification on the data is needed we will then consult the Staphylococcus. X1, X2 and X3 shows the Transition individual transition probabilities for all the three types. Probability of Type 1, Type 2 and Type 3 varieties of Staphylococcus. CONCLUSION In this paper we have used a probabilistic function for the A C T G Markov Property. We have applied this for probabilistic A ⎡0.08 0.05 0.07 0.06 ⎤ determination in the case of Avian flu virus and ⎢0.06 0.05 0.06 0.08 ⎥ Staphylococcus. The paper also presented a way for C ⎢ ⎥ X1 = identifying particular classes of genes or proteins. A given T ⎢0.04 0.06 0.08 0.07 ⎥ input sequence can belong to either of the given classes. By ⎢ ⎥ using a transition probability measure, one had to determine a G ⎣0.08 0.09 0.03 0.05 ⎦ value for each class even though they were similar. The paper presented a scheme such that the similar classes were merged by using the fuzzy composition relation and now instead of A C T G calculating each individual probability measure, one measure 147 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8,, No. 6, 2010 is sufficient to depict all the similar classes. This measure is further used in the log odds ratio to finally predict the class of the input sequence. REFERENCES  Anna Loanova, “Introduction to (log) odds ratio statistics and Methodology”, University of Groningen, 2008.  C. E. Shanon, “A Mathematical Theory of Communciation”, The Bell System Technical Journal, Vol. 27, pp. 379 – 423, 623 – 656.  D. W. Mount, “ Bioinformatics, Sequence and Genome Analysis”, 2nd edition, CSHL Press, (2004). (3)  Durbin, Eddy, Krogh, Mitchison, “Biological Sequence Analysis”, Cambridge University Press, 1998.  L. R Rabiner, “A tutorial on Hidden Markov Models and selected Application in speech recognition” -Proceeding of the IEEE, Vol.77, No.2 Feb.1989.  Lee, Kwang Hyung, “First course on Fuzzy Theory and Applications”, Advances in Soft Computing, Vol. 27, Springer, 2005.  Michael Hanss , “Applied Fuzzy Arithmetic : An Introduction with Engineering Applications”, Springer, 2005.  T. Dewey and M. Herzel, “Application of Information Theory to Biology”, Pacific Symposium on Biocomputing, 5:597 – 598 (2000).  W. J. Ewens, G. R. Grant, “Statistical Methods in Bioinformatics: An Introduction”, Vol. 13, 2nd edition, Springer.  Y. Ephraim, L. R. Rabiner, “On the Relations Between Modeling Approaches for Speech Recognition”, IEEE transactions on Information Theory, vol. 36, no. 2, March 1990.  A.Priyam, B.M.Karan, G.Sahoo, “A Probabilistic Model For Sequence Analysis”, Inter national Journal of Computer Science and Information security vol7 No.1, (2010) 244-247. 148 http://sites.google.com/site/ijcsis/ ISSN 1947-5500