Encoding Information for DNA computing by dkw12103

VIEWS: 101 PAGES: 45

									Encoding Information for
        DNA computing
            Shinnosuke Seki
Purpose
 What‟s an advantage of encoding?

 To make a “good” or tractable code set for DNA
  computing.

 Development of polynomial-time algorithms
  which decide whether a given code set is “good”
  or “bad”.
Claude Elwood Shannon
 The father of information
  theory (Shannon‟s entropy)
 Boolean algebra with binary
  arithmetic makes it possible to
  simplify electromechanical
  relays
 In “A mathematical theory of
  communication” [Sha48], he
  showed that we can send
  error-free information even on
  noisy channel.
 Chess program using minimax
  evaluation procedure
 etc. …
 Shannon‟s information channel
                         Positive Noise

                                     capacity C
sender   encoder                                  decoder    receiver

                                     Information flow R

                         Negative Noise

  R>C        overflow
  R≤C        We can make the error rate as small as possible.


  To attain R = C in the noisy channel, we need to find a
   „good‟ code.
Biological perspective
 Every biological reaction is an information
 channel model.
  example The case of heredity
                   Natural Selection


   parent   DNA        heredity        DNA   child


                       Mutation


 For billions of years, Mother Nature has
 developed wonderful code system?
 Biology -> Computer Science
Review:   in vitro DNA computing
1.   Encode a given problem into single or double-stranded
     DNAs (ssDNAs, dsDNAs)
2.   Computation by a succession of bio-operations.
3.   Decode the resulting solution and extract its output.
Review:   WK-complementarity

 Hydrogen bonds A              T    C   G
     Two strands which are
    1. complementary to each other
    2. with opposite directions
      can form a (complete) dsDNA.

    Example

       5‟ - A T C G G T C A A C T G C C C T A A T G  3‟
     3‟  T A G C C A G T T G A C G G G A T T A C - 5‟
Adleman‟s first trial
 Find a solution of Hamiltonian path problem in a solution
  in polynomial time order of the input graph.
 The solution is filled with encoding oligonucleotides.


 1                 3

                               1       2       3       4
                            ACG CTT ATA GAT CGG TTA ACT TAA
                                GAA TAT CTA GCC AAT TGA
                                  1 -> 2   2 -> 3   3 -> 4

 2                 4
What‟s a good code set?
 Each code word (oligonucleotide) shouldn‟t form any
  undesirable structure.
                              A T     A
          2
       ATA GAT
                              T A     G


 This may make itself inert.
 Code words don‟t interact with each other in an
  undesirable way.
 Structure formation is due to
    WK-complementarity
    Gibbs free energy
What‟s a good code set? (cont.)
 Uniform melting temperature
 Preventing undesirable hybridizations
 Other constraints
   Avoiding repeated bases
   Forbidden subsequences
     Using a restriction enzyme, its corresponding
      recognition site should appear only in intended sites
   Using only 3 types of nucleotides A, C, T
Melting temperature

 Melting temperature Tm of a dsDNA is
   the temperature at which half of the dsDNAs is
    denatured.
   The higher Tm is, the more stable the dsDNA is.
                  H
     vm 
           R ln( Ct /    S
       •   R: gas constant,
       •   Ct: total oligo concentration,
       •   ΔH & ΔS : enthalpy & entropy
       •   α: 1 for self-complementary and 4 for non-self
Nearest-neighborhood method
Refer to [AlSa97], [TKY04] ([8], [9] in this table)
Melting temperature (cont.)
 Uniform melting temperature
   To uniform Tm can eliminate a bias of hybridization.


 GC content
   The ratio of the # of G‟s and C‟s over the total # of
    nucleotides in a sequence
   G-C pair is more stable than A-T pair.
   Higher GC content implies higher Tm.
   Sequences are designed with 50% GC content.
Gibbs free energy (ΔG)
 A well-known indicator of stability for DNA structures
   A structure with lower ΔG is more stable.
   The ΔG of entire structure is the sum of ΔG of each
    substructures [ZuSt81].
Secondary structures look like…
Template method [ArKo02]
 Prepare 2 bit sequences, each of which has some
  desirable property
   (e.g., 50%-GC content, error-correction).
 Using convert rule, from these 2 sequences, we
  construct a sequence.
Template method (cont.)

 Design criteria
   Template
     An element x should have at least d-mismatches
      with xR, xx, xR xR, xxR, xRx.
     An exhaustive search to find a good template
   Map (error-correcting code)
     A code whose words have at least k-mismatches.
     e.g. BCH code
 Drawback
   It cannot prevent sequences from forming secondary
    structures.
AG-templates, GC-templates [KKA03]
                    GC-template
                      Template contains the
                       same # of 0‟s and 1‟s
                       (50% GC-content)
                      Map is an error correcting
                       code.
                    AG-template
                      Map is constant weight
                       codes (50% GC-content)
                      Results in the bigger set of
                       sequences
Other approaches
 DNASequenceGenerator [FBR00]
  A software with GUI
  Create a sequence with melting temperature, GC-
   content, no palindromes, start codons, nor restriction
   sites.
Other approaches
 Suyama‟s approach [YoSu00]
   To generate sequences randomly, add it into a
    sequence set iff it satisfied all of the following
    constraints:
     Uniform melting temperature
     No mis-hybridization
     No formation of stable secondary structure
   Drawback is to fall into local optima easily.
Other approaches
 Hybrid randomized neighborhoods [TuHo03]
   Stochastic local search (SLS) algorithm
   Searches neighbors by mutating current best
    sequences randomly with a probability ε.
   It moves to the direction where the # of constraint
    conflicts is maximally decreased with a probability 1-ε.
Other approaches
 GA (genetic algorithm)-based approach [ANH00]
   Use GAs to evaluate fitness of solutions
   As criteria
     Restriction sites
     GC-content
     Hamming distance
     Same base repetition
Other approaches
 Gibbs free energy base approach
   Taking thermodynamics into consideration
   Gibbs free energy as a stability measure
   Advantage
     Greater accuracy because it takes into account
      stability of loops or stacking between base-pairs
   Disadvantage
     More computational time to calculate free energy
   How to decrease this computational complexity?
   See [TKY05], [KNO08]
A formal language approach
 Design a set of structure-free codes in terms of
  WK-complementary.
 Advantage
  More reliable codes than Free-energy approach
  More efficient algorithm for decision problems
 Disadvantage
  Need to consider each structure separately.
A formal language approach (cont.)
 Abstracts of concepts
   {A, C, G, T} → an alphabet V,
   WK-complementarity → an antimorphic involution
     Involution
      • A mapping θ s.t. θ2 is identity (symmetry).
     Antimorphism
      • θ(xy) = θ(y)θ(x) (opposite direction).


   e.g. (TCATCCGATTTCGGG) = CCCGAAATCGGATGA

                  TCATCCGATTTCGGG


                 AGTAGGCTAAAGCCC
Bond-free properties [KKS05]

 θ-non-overlapping: L   ( L   empty




 θ-compliant: w  L, x, y   , w, x  w) y  L  xy  




   Strictly (a) : a property (a) with θ-non-overlapping
Bond-free properties [KKS05]
 θ-p-compliant: w  L, y   , w,  w) y  L  y  




 θ-s-compliant: w  L, x   , w, x  w)  L  x  
Bond-free properties [KKS05]
 θ-free: L2    ( L)   empty




 θ-sticky-free: w  L, x, y   , wx, y  w)  L  xy  
Bond-free properties [KKS05]

 θ-3‟-overhang-free: w  L, x, y   , wx,  w) y  L  xy  




 θ-5‟-overhang-free: w  L, x, y   , xw, y  w)  L  xy  




 θ-overhang-free: both of these
Decidability [KKS05]
 Theorem
   the following problem is decidable in quadratic time
    w.r.t. |A|
     Input: an NFA A,
     Output: Yes/No depending on whether L(A) satisfies
      any of the properties (or their strictly versions):
      • θ-compliant, θ-p-compliant, θ-s-compliant,
      • θ-sticky-free,
      • θ-3‟-overhang-free, θ-5‟-overhang-free, θ-overhang-free.
Decidability and maximality [KKS05]
 Theorem
  Let M be a regular language and L is a regular subset
   of M with a property ρ:
    ρ is one of the followings:
     •   θ-compliant,
     •   θ-p-compliant,
     •   θ-s-compliant, or
     •   θ-sticky-free
  Then it is decidable whether L is a maximal subset of
   M satisfying ρ.
Secondary structure prevention
 Secondary structures:
   Hairpin-loop (or simply hairpin)
   Internal loop
   Multiple-branch loop
   Pseudoknot
 They can be undesirable
   e.g. for Adleman‟s encoding technique for Hamiltonian
    Path Problem (HPP).
Secondary Structures
     Hairpin
                                Hairpin frame
5‟
                                (multiple loop)

3‟

                           5‟


      Internal loop        3‟

5‟     A C G T        3‟


3‟       G C C        5‟
Hairpin-free language
 A formal model of hairpin: x v y θ(v) z.


          TAA---ACG---CGTTA---CGT---CGGT
           x      v        y      θ(v)      z

 Hairpin freeness
   Intuitively it‟s almost impossible to prevent hairpins of
    short stack length (say 2 or 3).
   Our desire is to prevent any hairpin of stack length no
    less than some given parameter k.
Hairpin-free language [KKL06]
 A word w is (θ, k)-hairpin-free (abbr. hp(θ, k)-free) iff
                 w  xvy (v) z | v | k.

 hpf(θ, k) : the set of all hp(θ, k)-free words on Σ*
 hp(θ, k) : Σ* - hpf(θ, k).

 A language L is called (θ, k)-hairpin-free iff

                       L  hpf ( , k )
Regularity of hairpin languages
 hp( , k )    X            wX * ( w) X *
                           *

                 |w|  k

       X                                   X           X

                           w                    θ(w)


 hp(θ, k) and hpf(θ, k) are regular.

 For a hp(θ, k)-free language L, there exists a finite
  automaton M s.t. L = L(M).
Hairpin Freedom Problems
 Hairpin-Freedom problem
    Input: A nondeterministic automaton M,
    Output: Y/N depending on whether L(M) is hp(θ, k)-free.


 Maximal Hairpin-Freedom problem
    Input: A deterministic automaton M1, and NFA M2.
    Output: Y/N depending on whether there is a word
    w  L( M 2)  L( M 1) s.t. L( M 1)  {w} is hp(θ, k)-free.
Decidability
 The hairpin-freedom problem for regular languages is
  decidable in O(| M |) time.

 The maximal hairpin-freedom problem for regular
  languages is decidable in O(| M 1 |  | M 2 |) time.
Hairpin Frames
 So-called Multiple loop
 hp-frame of degree n:
       x1v1 y1 (v1) z1...xnvnyn (vn) zn

 Figure is an example of hp-
  frame of degree 3.
 A word u is hp(fr, j)-word if it
  contains a hp-frame of
  degree j.
Regularity & decidability
 hp(θ, fr, j) : the set of all hp(fr, j)-words on Σ*
 hpf(θ, fr, j) : its complement in Σ*

 The languages hp(θ, fr, j) & hpf(θ, fr, j) are regular.

 The hp(fr, j)-freedom problem is decidable in linear
  time.
 The maximal hp(fr, j)-freedom problem is decidable
  in O(| M 1 |  | M 2 |) time.
Application : DNA-HRAMs
          C       G
      T               A
          G       C       opening
          T       A
          C       G                 --A-C-T-G-T-C-G-A-C-A-G-T--
          A       T
                          closing
              0                                  1
 n-bit DNA-HRAM consists of n hairpins.
 Each hairpin stores 1-bit information by forming and
  deforming a hairpin as shown above.
n-bit DNA-HRAM
 Concatenation of n 1-bit RAM, which is equivalent to hp-
  frame of degree n.
             x1v1 y1 (v1) z1...xnvnyn (vn) zn
 In order for this word to work as n-bit RAM, the following
  subword should be hpf(θ, 20)-free.

             x1v1 y1z1...xnvnynzn
 DNA memory with 4 hairpins was proposed in [KYO08].
Reference

 [AlSa97] Allawi, HT., SantaLucia, J.: Thermodynamics and NMR of internal
  G T mismatches in DNA. Biochemistry 36(34) (1997) 10581-10594
 [ArKo02] Arita, M., Kobayashi, S.: DNA sequence design using templates.
  New Generation Computing 20 (2002) 263-277
 [ANH00] Arita, M., Nishikawa, A., Hagiya, M., Komiya, K., Gouzu, H.,
  Sakamoto, K.: Improving sequence design for dna computing. Proc. Genetic
  and Evolutionary Computation Conference (2000) 875-882.
 [FBR00] Feldkamp, U., Saghafi, S., Rauhe, H.: A DNA sequence compiler.
  Proc. DNA6, (2000)
 [KKS05] Kari, L., Konstantinidis, S., Sosik, P.: Preventing undesirable bonds
  between DNA codewords. Prof. DNA10, LNCS 3384 (2005) 182-191.
 [KKL06] Kari, L., Konstantinidis, S., Losseva, E., Sosik, P., Thierrin, G.: A
  formal language analysis of DNA hairpin structures. Fundamenta
  Informaticae 71 (2006) 453-475
 [KKA03] Kobayashi, S., Kondo, T., Arita, M.: On template method for DNA
  sequence design. Proc. DNA8, LNCS 2568 (2003) 205-214
Reference (cont.)

 [KNO08] Kawashimo, S., Ng, Y-K., Ono, H., Sadakane, K., Yamashita, M.:
  Speeding up local-search type algorithms for designing dna sequences
  under thermodynamical constraints. Proc. DNA14 (2008) 152-161
 [KYO08] Kameda, A., Yamamoto, M., Ohuchi, A., Yaegashi, S., Hagiya, M.:
  Unravel four hairpins! Natural Computing 7 (2008) 287-298
 [RFL01] Ruben, A. J., Freeland, S. J., Landweber, L. F.: PUNCH: An
  evolutionary algorithm for optimizing bit set selection. DNA7 (2001) 150-160
 [Sha48] Shannon, C.E.: A mathematical theory of communication. Bell
  System Technical Journal 27 (1948) 379-423, 623-656
 [TKY04] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.:
  Thermodynamic parameters based on a nearest-neighbor model for DNA
  sequences with a single-bulge loop. Biochemistry 43(22) (2004) 7143-7150
 [TKY05] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Design of
  nucleic acid sequences for DNA computing based on a thermodynamic
  approach. Nucleic Acids Res. 33(3) (2005) 903-911
Reference (cont.)

 [TuHo03] Tulpan, D., Hoos, H.: Hybrid randomised neighbourhoods improve
  stochastic local search for dna code design. In Advances in Artificial
  Intelligence: 16th Conference of the Canadian Society for Computational
  Studies of Intelligence, 2671 (2003) 418-433
 [YoSu00] Yoshida, H., Suyama, A.: Solution to 3-sat by breadth first search.
  Proc. the 5th DIMACS Workshop on DNA Based Computers, 54 (2000) 9-22
 [ZuSt81] Zuker, M., Stiegler, P.: Optimal computer folding of large RNA
  sequences using thermodynamics and auxiliary information. Nucleic Acids
  Res. 9(1) (1981) 133-148

								
To top