Application of Fuzzy Composition Relation For DNA Sequence Classification

Document Sample
Application of Fuzzy Composition Relation For DNA Sequence Classification Powered By Docstoc
					                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 8,, No. 6, 2010


   Application of Fuzzy Composition Relation For DNA
                  Sequence Classification


                        Amrita Priyam                                                         B. M. Karan+, G. Sahoo++
                                                                                  +
         Dept. of Computer Science and Engineering                                    Dept. of Electrical and Electronics Engineering
                                                                                           ++
                Birla Institute of Technology                                                 Dept. of Information Technology
                        Ranchi, India.                                                         Birla Institute of Technology
                  amrita.priyam@gmail.com                                                              Ranchi, India

Abstract— Abstract— This paper presents a probabilistic                    class has a set of transition probabilities associated with it.
approach for DNA sequence analysis. A DNA sequence                         This transition probability is the measure of going from one
consists of an arrangement of the four nucleotides A, C, T                 state to another. Now, each class has a set of transition
and G. There are various representation schemes for a                      probability associated with it. We further group the similar
DNA sequence. This paper uses a representation scheme in                   classes and their respective transition probability is merged
which the probability of a symbol depends only on the                      using a fuzzy composition relation. Finally a log-odd ratio is
occurrence of the previous symbol. This type of model is                   used for deciding to which class the given sequence belongs
defined by two parameters, a set of states Q, which emit                   to.
symbols and a set of transitions between the states. Each
                                                                                                II.   DNA SEQUENCES
transition has an associated transition probability,         aij ,
which represents the conditional probability of going to                   DNA sequence is a succession of letters representing the
state j in the next step, given that the current state is i.               primary structure of a real or hypothetical DNA molecule or
Further, the paper combines the different types of                         strand, with the capacity to carry information as described by
classification classes using a Fuzzy composition relation.                 the central dogma of molecular biology. There are 4
Finally a log-odd ratio is used for deciding to which class                nucleotide bases (A – Adenine, C – Cytosine, G – Guanine, T
the given sequence belongs to.                                             – Thymine). DNA sequencing is the process of determining
                                                                           the exact order of the bases A, T, C and G in a piece of DNA
Keywords-component; Transition Probability, Fuzzy Composition              [3]. In essence, the DNA is used as a template to generate a set
Relation, Log-Odd ratio                                                    of fragments that differ in length from each other by a single
                                                                           base. The fragments are then separated by size, and the bases
                                                                           at the end are identified, recreating the original sequence of
                      I.    INTRODUCTION
                                                                           the DNA[8][9]. The most commonly used method of
A DNA sequence is a succession of the letters A, C, T and G.               sequencing DNA the dideoxy or chain termination method
The sequences are any combination of these letters. A physical             was developed by Fred Sanger in 1977 (for which he won his
or mathematical model of a system produces a sequence of                   second Nobel Prize). The key to the method is the use of
symbols according to a certain probability associated with                 modified bases called dideoxy bases; when a piece of DNA is
them. This is known as a stochastic process, that is, it is a              being replicated and a dideoxy base is incorporated into the
mathematical model for a biological system which is governed               new chain, it stops the replication reaction.
by a set of probability measure. The occurrence of the letters
can lead us to the further study of genetic disorder. There are            Most DNA sequencing is carried out using the chain
various representation schemes for a DNA sequence. This                    termination method [4]. This involves the synthesis of new
paper uses a representation scheme in which the probability of             DNA strands on a single standard template and the random
a symbol depends only on the occurrence of the previous                    incorporation of chain-terminating nucleotide analogues. The
symbol and not on any other symbol before that. This type of               chain termination method produces a set of DNA molecules
model is defined by two parameters, a set of states Q, which               differing in length by one nucleotide. The last base in each
emit symbols and a set of transitions between the states. Each             molecule can be identified by way of a unique label.
transition has on associated transition probability,   aij , which         Separation of these DNA molecules according to size places
                                                                           them in correct order to read off the sequence.
represents the conditional probability of going to state jfrom
state i in the next step, given that the current state is i. Each



                                                                     145                                http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                              Vol. 8,, No. 6, 2010
       III.       A PROBABILISTIC APPROACH FOR SEQUENCE                                      L                                                                (2)
                          REPRESENTATION                                        P ( x) = ∏ a xi −1xi
                                                                                            i =1
A DNA sequence is essentially represented as a string of four
characters A, C, T, G and looks something like
ACCTGACCTTACG. These strings can also be represented in                         If there are n classes, then we calculate the probability of a
terms of some probability measures and using these measures                     sequence x being in all the classes. To overcome this
it can be depicted graphically as well. This graphical                          drawback we use Fuzzy composition relation. That is, we
representation matches the Markov Hidden Model. A physical                      divide the n classes into different groups based on their
or mathematical model of a system produces a sequence of                        similarities. So, if out of n classes, m are similar then they are
symbols according to a certain probability associated with                      treated as one group and their individual transition probability
them. This is known as a stochastic process [2]. There are                      tables are merged using the fuzzy composition relation. The
different ways to use probabilities for depicting the DNA                       remaining (n – m) classes are similarly grouped. Lets say, if
sequences. The diagrammatical representation can be shown                       there are two classes R1 and R2, the Fuzzy composition
as follows:                                                                     relation between R1 and R2 [6][7] can be written as follows:

                                                                                R1 o R2 = Max( Min( R1 ( x), R2 ( y )))                                       (3)




                                                                                      Different class representation         Grouping of similar classes



                                                                                                    Fig 2: Grouping of similar classes

                                                                                A table is then constructed representing the entire (n – m)
                                                                                similar classes. From this table we compute the probability
FIG 1: [The states of A, C, G and T.]                                           that a sequence x belongs to a given group using the following
For example, the transition probability from state G to state T                 equation:
is 0.08, i,e,
P( xi = T | xi −1 = G ) = 0.08                                                        P( x | +) L     a + i−1 x                                               (4)
                                                                                log            = ∑ log − i
                                                                                      P( x | −) i=1   ai−1 xi
In a given sequence x of length L, x1, x2, …… xL, represent the
nucleotides. The sequence starts at the first state x1, and makes
                                                                                Here “+” represents transition probability of the sequence
successive transitions to x2, x3 and so on, till xL. Using Markov
                                                                                belonging to one of the classes using fuzzy composition
property [6], the probability of xL, depends on the value of
                                                                                relation and “-“ represents the transition probability of the
only the previous state, xL-1, not on the entire previous
                                                                                same for another class [1].
sequence. This characteristic is known as Markov property [5]
and can be written as:
                                                                                If this ratio is greater than zero then we can say that the
                                                                                sequence x is from the first class else from the other one.
P( x) = P( xL | xL−1 ) P( xL−1 | xL−2 ).......P( x2 | x1 ) P( x1 )
                                                                                An Example:
              L                                                   (1)
= P ( x1 )∏ P ( xi | xi −1 )                                                    Let us consider an example for applying this classification
          i =2                                                                  method. We have taken into consideration the Swine flu
                                                                                data.[11] The different categories of the Swine flu data are
In Equation (1) we need to specify P(x1), the probability of the                shown as R1, R2 and R3.
starting state. For simplicity, we would like to model this as a
transition too. This can be done by adding a begin state,                       R1, R2 and R3 shows the Transition Probability of Type 1, Type
denoted by 0, so that the starting state becomes x0=0.                          2 and Type 3 varieties of Avian Flu.

Now considering       axi −1x       , the transition probability we can
                                i
rewrite (1) as




                                                                          146                                          http://sites.google.com/site/ijcsis/
                                                                                                                       ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 8,, No. 6, 2010
              A       C            T              G                         A    ⎡0.09       0.06          0.06            0.06 ⎤
     A ⎡0.13          0.06        0.09          0.08⎤                            ⎢0.07       0.05          0.06            0.07 ⎥
       ⎢                                                                    C
     C ⎢0.09          0.04        0.05          0.02⎥
                                                    ⎥
                                                                       X2 =      ⎢
                                                                                 ⎢0.04
                                                                                                                                ⎥
R1 =                                                                        T                0.05          0.07            0.07 ⎥
     T ⎢0.06          0.05        0.05          0.07⎥                            ⎢                                              ⎥
       ⎢                                            ⎥                          G ⎣0.07       0.09          0.04            0.05⎦
     G ⎣0.08          0.04        0.05          0.06⎦

              A       C           T              G                                   A         C              T             G
     A  ⎡0.13         0.06        0.09          0.08 ⎤                      A    ⎡0.07       0.05          0.07            0.04⎤
        ⎢0.08         0.04        0.04          0.02 ⎥                           ⎢0.05       0.06          0.06            0.08 ⎥
     C  ⎢                                            ⎥                      C    ⎢                                              ⎥
R2 =                                                                   X3 =
     T  ⎢0.06         0.05        0.06          0.07 ⎥                      T    ⎢0.03       0.06          0.04            0.09 ⎥
        ⎢                                            ⎥                           ⎢                                              ⎥
      G ⎣0.08         0.04        0.04          0.06 ⎦                         G ⎣0.06       0.09          0.06            0.07 ⎦


                                                                       Applying Fuzzy composition relation to these tables we get
              A        C               T              G                the final table as
     A      ⎡0.08     0.05        0.08           0.08 ⎤
            ⎢0.08     0.03        0.06           0.02 ⎥
                                                                               A         C          T               G
     C      ⎢                                         ⎥                      ⎡0.07                                0.06⎤
R3 =                                                                   A              0.06         0.05
     T      ⎢0.05     0.06        0.06           0.09⎥                       ⎢0.08    0.09         0.07           0.06⎥
            ⎢                                         ⎥                C     ⎢                                         ⎥
     G      ⎣0.08     0.06        0.04           0.07⎦                 T     ⎢0.04    0.03         0.07           0.06⎥
                                                                             ⎢                                         ⎥
                                                                       G     ⎣0.05    0.08         0.04           0.02 ⎦
 Using the Fuzzy composition relation technique on R1 and R2
and then the result of the application with relation R3 we get         Suppose we are given the sequence x which is to be classified
the final table for the Swine Flu class as:                            as either falling into any of the given classes and say x =
                                                                       CGCG

       A          C           T            G                           From the final fuzzy composition table the log odds ratio of
A   ⎡0.08     0.06         0.08        0.09 ⎤                          this sequence is:
    ⎢0.08     0.06         0.08        0.09 ⎥
C   ⎢                                       ⎥                                0.09       0.06       0.09
    ⎢0.07                                                              log        + log      + log      = 0.032270
T             0.06         0.07        0.07 ⎥                                0.08       0.07       0.08
    ⎢                                       ⎥
G   ⎣0.08     0.06         0.08        0.08 ⎦
                                                                       Now, since this ratio is greater than 0, we can conclude that
                                                                       the input sequence x belongs to the class Avian Flu. If further
Similarly, we can repeat the same procedure for another class
                                                                       classification on the data is needed we will then consult the
Staphylococcus. X1, X2 and X3 shows the Transition                     individual transition probabilities for all the three types.
Probability of Type 1, Type 2 and Type 3 varieties of
Staphylococcus.                                                                                 CONCLUSION
                                                                       In this paper we have used a probabilistic function for the
              A           C            T              G
                                                                       Markov Property. We have applied this for probabilistic
     A      ⎡0.08     0.05         0.07           0.06 ⎤               determination in the case of Avian flu virus and
            ⎢0.06     0.05         0.06           0.08 ⎥               Staphylococcus. The paper also presented a way for
     C      ⎢                                          ⎥
X1 =                                                                   identifying particular classes of genes or proteins. A given
     T      ⎢0.04     0.06         0.08           0.07 ⎥               input sequence can belong to either of the given classes. By
            ⎢                                          ⎥               using a transition probability measure, one had to determine a
     G      ⎣0.08     0.09         0.03           0.05 ⎦
                                                                       value for each class even though they were similar. The paper
                                                                       presented a scheme such that the similar classes were merged
                                                                       by using the fuzzy composition relation and now instead of
                  A       C            T              G                calculating each individual probability measure, one measure




                                                                 147                               http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                  Vol. 8,, No. 6, 2010
is sufficient to depict all the similar classes. This measure is
further used in the log odds ratio to finally predict the class of
the input sequence.

REFERENCES

    [1]      Anna Loanova, “Introduction to (log) odds ratio statistics and
             Methodology”, University of Groningen, 2008.

    [2]      C. E. Shanon, “A Mathematical Theory of Communciation”,
             The Bell System Technical Journal, Vol. 27, pp. 379 – 423, 623
             – 656.

    [3]      D. W. Mount, “ Bioinformatics, Sequence and Genome
             Analysis”, 2nd edition, CSHL Press, (2004). (3)

    [4]      Durbin, Eddy, Krogh, Mitchison, “Biological Sequence
             Analysis”, Cambridge University Press, 1998.
    [5]      L. R Rabiner, “A tutorial on Hidden Markov Models and
             selected Application in speech recognition” -Proceeding of the
             IEEE, Vol.77, No.2 Feb.1989.
    [6]      Lee, Kwang Hyung, “First course on Fuzzy Theory and
             Applications”, Advances in Soft Computing, Vol. 27, Springer,
             2005.
    [7]      Michael Hanss , “Applied Fuzzy Arithmetic : An Introduction
             with Engineering Applications”, Springer, 2005.
    [8]      T. Dewey and M. Herzel, “Application of Information Theory to
             Biology”, Pacific Symposium on Biocomputing, 5:597 – 598
             (2000).

    [9]      W. J. Ewens, G. R. Grant, “Statistical Methods in
             Bioinformatics: An Introduction”, Vol. 13, 2nd edition, Springer.
    [10]     Y. Ephraim, L. R. Rabiner, “On the Relations Between
             Modeling Approaches for Speech Recognition”, IEEE
             transactions on Information Theory, vol. 36, no. 2, March 1990.

    [11]       A.Priyam, B.M.Karan, G.Sahoo, “A Probabilistic Model For
              Sequence Analysis”, Inter national Journal of Computer Science
             and Information security vol7 No.1, (2010) 244-247.




                                                                                 148                          http://sites.google.com/site/ijcsis/
                                                                                                              ISSN 1947-5500

				
DOCUMENT INFO
Description: IJCSIS is an open access publishing venue for research in general computer science and information security. Target Audience: IT academics, university IT faculties; industry IT departments; government departments; the mobile industry and computing industry. Coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; computer science, computer applications, multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management. The average paper acceptance rate for IJCSIS issues is kept at 25-30% with an aim to provide selective research work of quality in the areas of computer science and engineering. Thanks for your contributions in September 2010 issue and we are grateful to the experienced team of reviewers for providing valuable comments.