Application of Fuzzy Composition Relation For DNA Sequence Classification
IJCSIS is an open access publishing venue for research in general computer science and information security. Target Audience: IT academics, university IT faculties; industry IT departments; government departments; the mobile industry and computing industry. Coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; computer science, computer applications, multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management. The average paper acceptance rate for IJCSIS issues is kept at 25-30% with an aim to provide selective research work of quality in the areas of computer science and engineering. Thanks for your contributions in September 2010 issue and we are grateful to the experienced team of reviewers for providing valuable comments.
- views:
- 88
- posted:
- 10/10/2010
- language:
- English
- pages:
- 4

(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8,, No. 6, 2010
Application of Fuzzy Composition Relation For DNA
Sequence Classification
Amrita Priyam B. M. Karan+, G. Sahoo++
+
Dept. of Computer Science and Engineering Dept. of Electrical and Electronics Engineering
++
Birla Institute of Technology Dept. of Information Technology
Ranchi, India. Birla Institute of Technology
amrita.priyam@gmail.com Ranchi, India
Abstract— Abstract— This paper presents a probabilistic class has a set of transition probabilities associated with it.
approach for DNA sequence analysis. A DNA sequence This transition probability is the measure of going from one
consists of an arrangement of the four nucleotides A, C, T state to another. Now, each class has a set of transition
and G. There are various representation schemes for a probability associated with it. We further group the similar
DNA sequence. This paper uses a representation scheme in classes and their respective transition probability is merged
which the probability of a symbol depends only on the using a fuzzy composition relation. Finally a log-odd ratio is
occurrence of the previous symbol. This type of model is used for deciding to which class the given sequence belongs
defined by two parameters, a set of states Q, which emit to.
symbols and a set of transitions between the states. Each
II. DNA SEQUENCES
transition has an associated transition probability, aij ,
which represents the conditional probability of going to DNA sequence is a succession of letters representing the
state j in the next step, given that the current state is i. primary structure of a real or hypothetical DNA molecule or
Further, the paper combines the different types of strand, with the capacity to carry information as described by
classification classes using a Fuzzy composition relation. the central dogma of molecular biology. There are 4
Finally a log-odd ratio is used for deciding to which class nucleotide bases (A – Adenine, C – Cytosine, G – Guanine, T
the given sequence belongs to. – Thymine). DNA sequencing is the process of determining
the exact order of the bases A, T, C and G in a piece of DNA
Keywords-component; Transition Probability, Fuzzy Composition [3]. In essence, the DNA is used as a template to generate a set
Relation, Log-Odd ratio of fragments that differ in length from each other by a single
base. The fragments are then separated by size, and the bases
at the end are identified, recreating the original sequence of
I. INTRODUCTION
the DNA[8][9]. The most commonly used method of
A DNA sequence is a succession of the letters A, C, T and G. sequencing DNA the dideoxy or chain termination method
The sequences are any combination of these letters. A physical was developed by Fred Sanger in 1977 (for which he won his
or mathematical model of a system produces a sequence of second Nobel Prize). The key to the method is the use of
symbols according to a certain probability associated with modified bases called dideoxy bases; when a piece of DNA is
them. This is known as a stochastic process, that is, it is a being replicated and a dideoxy base is incorporated into the
mathematical model for a biological system which is governed new chain, it stops the replication reaction.
by a set of probability measure. The occurrence of the letters
can lead us to the further study of genetic disorder. There are Most DNA sequencing is carried out using the chain
various representation schemes for a DNA sequence. This termination method [4]. This involves the synthesis of new
paper uses a representation scheme in which the probability of DNA strands on a single standard template and the random
a symbol depends only on the occurrence of the previous incorporation of chain-terminating nucleotide analogues. The
symbol and not on any other symbol before that. This type of chain termination method produces a set of DNA molecules
model is defined by two parameters, a set of states Q, which differing in length by one nucleotide. The last base in each
emit symbols and a set of transitions between the states. Each molecule can be identified by way of a unique label.
transition has on associated transition probability, aij , which Separation of these DNA molecules according to size places
them in correct order to read off the sequence.
represents the conditional probability of going to state jfrom
state i in the next step, given that the current state is i. Each
145 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8,, No. 6, 2010
III. A PROBABILISTIC APPROACH FOR SEQUENCE L (2)
REPRESENTATION P ( x) = ∏ a xi −1xi
i =1
A DNA sequence is essentially represented as a string of four
characters A, C, T, G and looks something like
ACCTGACCTTACG. These strings can also be represented in If there are n classes, then we calculate the probability of a
terms of some probability measures and using these measures sequence x being in all the classes. To overcome this
it can be depicted graphically as well. This graphical drawback we use Fuzzy composition relation. That is, we
representation matches the Markov Hidden Model. A physical divide the n classes into different groups based on their
or mathematical model of a system produces a sequence of similarities. So, if out of n classes, m are similar then they are
symbols according to a certain probability associated with treated as one group and their individual transition probability
them. This is known as a stochastic process [2]. There are tables are merged using the fuzzy composition relation. The
different ways to use probabilities for depicting the DNA remaining (n – m) classes are similarly grouped. Lets say, if
sequences. The diagrammatical representation can be shown there are two classes R1 and R2, the Fuzzy composition
as follows: relation between R1 and R2 [6][7] can be written as follows:
R1 o R2 = Max( Min( R1 ( x), R2 ( y ))) (3)
Different class representation Grouping of similar classes
Fig 2: Grouping of similar classes
A table is then constructed representing the entire (n – m)
similar classes. From this table we compute the probability
FIG 1: [The states of A, C, G and T.] that a sequence x belongs to a given group using the following
For example, the transition probability from state G to state T equation:
is 0.08, i,e,
P( xi = T | xi −1 = G ) = 0.08 P( x | +) L a + i−1 x (4)
log = ∑ log − i
P( x | −) i=1 ai−1 xi
In a given sequence x of length L, x1, x2, …… xL, represent the
nucleotides. The sequence starts at the first state x1, and makes
Here “+” represents transition probability of the sequence
successive transitions to x2, x3 and so on, till xL. Using Markov
belonging to one of the classes using fuzzy composition
property [6], the probability of xL, depends on the value of
relation and “-“ represents the transition probability of the
only the previous state, xL-1, not on the entire previous
same for another class [1].
sequence. This characteristic is known as Markov property [5]
and can be written as:
If this ratio is greater than zero then we can say that the
sequence x is from the first class else from the other one.
P( x) = P( xL | xL−1 ) P( xL−1 | xL−2 ).......P( x2 | x1 ) P( x1 )
An Example:
L (1)
= P ( x1 )∏ P ( xi | xi −1 ) Let us consider an example for applying this classification
i =2 method. We have taken into consideration the Swine flu
data.[11] The different categories of the Swine flu data are
In Equation (1) we need to specify P(x1), the probability of the shown as R1, R2 and R3.
starting state. For simplicity, we would like to model this as a
transition too. This can be done by adding a begin state, R1, R2 and R3 shows the Transition Probability of Type 1, Type
denoted by 0, so that the starting state becomes x0=0. 2 and Type 3 varieties of Avian Flu.
Now considering axi −1x , the transition probability we can
i
rewrite (1) as
146 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8,, No. 6, 2010
A C T G A ⎡0.09 0.06 0.06 0.06 ⎤
A ⎡0.13 0.06 0.09 0.08⎤ ⎢0.07 0.05 0.06 0.07 ⎥
⎢ C
C ⎢0.09 0.04 0.05 0.02⎥
⎥
X2 = ⎢
⎢0.04
⎥
R1 = T 0.05 0.07 0.07 ⎥
T ⎢0.06 0.05 0.05 0.07⎥ ⎢ ⎥
⎢ ⎥ G ⎣0.07 0.09 0.04 0.05⎦
G ⎣0.08 0.04 0.05 0.06⎦
A C T G A C T G
A ⎡0.13 0.06 0.09 0.08 ⎤ A ⎡0.07 0.05 0.07 0.04⎤
⎢0.08 0.04 0.04 0.02 ⎥ ⎢0.05 0.06 0.06 0.08 ⎥
C ⎢ ⎥ C ⎢ ⎥
R2 = X3 =
T ⎢0.06 0.05 0.06 0.07 ⎥ T ⎢0.03 0.06 0.04 0.09 ⎥
⎢ ⎥ ⎢ ⎥
G ⎣0.08 0.04 0.04 0.06 ⎦ G ⎣0.06 0.09 0.06 0.07 ⎦
Applying Fuzzy composition relation to these tables we get
A C T G the final table as
A ⎡0.08 0.05 0.08 0.08 ⎤
⎢0.08 0.03 0.06 0.02 ⎥
A C T G
C ⎢ ⎥ ⎡0.07 0.06⎤
R3 = A 0.06 0.05
T ⎢0.05 0.06 0.06 0.09⎥ ⎢0.08 0.09 0.07 0.06⎥
⎢ ⎥ C ⎢ ⎥
G ⎣0.08 0.06 0.04 0.07⎦ T ⎢0.04 0.03 0.07 0.06⎥
⎢ ⎥
G ⎣0.05 0.08 0.04 0.02 ⎦
Using the Fuzzy composition relation technique on R1 and R2
and then the result of the application with relation R3 we get Suppose we are given the sequence x which is to be classified
the final table for the Swine Flu class as: as either falling into any of the given classes and say x =
CGCG
A C T G From the final fuzzy composition table the log odds ratio of
A ⎡0.08 0.06 0.08 0.09 ⎤ this sequence is:
⎢0.08 0.06 0.08 0.09 ⎥
C ⎢ ⎥ 0.09 0.06 0.09
⎢0.07 log + log + log = 0.032270
T 0.06 0.07 0.07 ⎥ 0.08 0.07 0.08
⎢ ⎥
G ⎣0.08 0.06 0.08 0.08 ⎦
Now, since this ratio is greater than 0, we can conclude that
the input sequence x belongs to the class Avian Flu. If further
Similarly, we can repeat the same procedure for another class
classification on the data is needed we will then consult the
Staphylococcus. X1, X2 and X3 shows the Transition individual transition probabilities for all the three types.
Probability of Type 1, Type 2 and Type 3 varieties of
Staphylococcus. CONCLUSION
In this paper we have used a probabilistic function for the
A C T G
Markov Property. We have applied this for probabilistic
A ⎡0.08 0.05 0.07 0.06 ⎤ determination in the case of Avian flu virus and
⎢0.06 0.05 0.06 0.08 ⎥ Staphylococcus. The paper also presented a way for
C ⎢ ⎥
X1 = identifying particular classes of genes or proteins. A given
T ⎢0.04 0.06 0.08 0.07 ⎥ input sequence can belong to either of the given classes. By
⎢ ⎥ using a transition probability measure, one had to determine a
G ⎣0.08 0.09 0.03 0.05 ⎦
value for each class even though they were similar. The paper
presented a scheme such that the similar classes were merged
by using the fuzzy composition relation and now instead of
A C T G calculating each individual probability measure, one measure
147 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8,, No. 6, 2010
is sufficient to depict all the similar classes. This measure is
further used in the log odds ratio to finally predict the class of
the input sequence.
REFERENCES
[1] Anna Loanova, “Introduction to (log) odds ratio statistics and
Methodology”, University of Groningen, 2008.
[2] C. E. Shanon, “A Mathematical Theory of Communciation”,
The Bell System Technical Journal, Vol. 27, pp. 379 – 423, 623
– 656.
[3] D. W. Mount, “ Bioinformatics, Sequence and Genome
Analysis”, 2nd edition, CSHL Press, (2004). (3)
[4] Durbin, Eddy, Krogh, Mitchison, “Biological Sequence
Analysis”, Cambridge University Press, 1998.
[5] L. R Rabiner, “A tutorial on Hidden Markov Models and
selected Application in speech recognition” -Proceeding of the
IEEE, Vol.77, No.2 Feb.1989.
[6] Lee, Kwang Hyung, “First course on Fuzzy Theory and
Applications”, Advances in Soft Computing, Vol. 27, Springer,
2005.
[7] Michael Hanss , “Applied Fuzzy Arithmetic : An Introduction
with Engineering Applications”, Springer, 2005.
[8] T. Dewey and M. Herzel, “Application of Information Theory to
Biology”, Pacific Symposium on Biocomputing, 5:597 – 598
(2000).
[9] W. J. Ewens, G. R. Grant, “Statistical Methods in
Bioinformatics: An Introduction”, Vol. 13, 2nd edition, Springer.
[10] Y. Ephraim, L. R. Rabiner, “On the Relations Between
Modeling Approaches for Speech Recognition”, IEEE
transactions on Information Theory, vol. 36, no. 2, March 1990.
[11] A.Priyam, B.M.Karan, G.Sahoo, “A Probabilistic Model For
Sequence Analysis”, Inter national Journal of Computer Science
and Information security vol7 No.1, (2010) 244-247.
148 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Get documents about "