VIEWS: 101 PAGES: 45 CATEGORY: Education POSTED ON: 3/5/2010 Public Domain
Encoding Information for DNA computing Shinnosuke Seki Purpose What‟s an advantage of encoding? To make a “good” or tractable code set for DNA computing. Development of polynomial-time algorithms which decide whether a given code set is “good” or “bad”. Claude Elwood Shannon The father of information theory (Shannon‟s entropy) Boolean algebra with binary arithmetic makes it possible to simplify electromechanical relays In “A mathematical theory of communication” [Sha48], he showed that we can send error-free information even on noisy channel. Chess program using minimax evaluation procedure etc. … Shannon‟s information channel Positive Noise capacity C sender encoder decoder receiver Information flow R Negative Noise R>C overflow R≤C We can make the error rate as small as possible. To attain R = C in the noisy channel, we need to find a „good‟ code. Biological perspective Every biological reaction is an information channel model. example The case of heredity Natural Selection parent DNA heredity DNA child Mutation For billions of years, Mother Nature has developed wonderful code system? Biology -> Computer Science Review: in vitro DNA computing 1. Encode a given problem into single or double-stranded DNAs (ssDNAs, dsDNAs) 2. Computation by a succession of bio-operations. 3. Decode the resulting solution and extract its output. Review: WK-complementarity Hydrogen bonds A T C G Two strands which are 1. complementary to each other 2. with opposite directions can form a (complete) dsDNA. Example 5‟ - A T C G G T C A A C T G C C C T A A T G 3‟ 3‟ T A G C C A G T T G A C G G G A T T A C - 5‟ Adleman‟s first trial Find a solution of Hamiltonian path problem in a solution in polynomial time order of the input graph. The solution is filled with encoding oligonucleotides. 1 3 1 2 3 4 ACG CTT ATA GAT CGG TTA ACT TAA GAA TAT CTA GCC AAT TGA 1 -> 2 2 -> 3 3 -> 4 2 4 What‟s a good code set? Each code word (oligonucleotide) shouldn‟t form any undesirable structure. A T A 2 ATA GAT T A G This may make itself inert. Code words don‟t interact with each other in an undesirable way. Structure formation is due to WK-complementarity Gibbs free energy What‟s a good code set? (cont.) Uniform melting temperature Preventing undesirable hybridizations Other constraints Avoiding repeated bases Forbidden subsequences Using a restriction enzyme, its corresponding recognition site should appear only in intended sites Using only 3 types of nucleotides A, C, T Melting temperature Melting temperature Tm of a dsDNA is the temperature at which half of the dsDNAs is denatured. The higher Tm is, the more stable the dsDNA is. H vm R ln( Ct / S • R: gas constant, • Ct: total oligo concentration, • ΔH & ΔS : enthalpy & entropy • α: 1 for self-complementary and 4 for non-self Nearest-neighborhood method Refer to [AlSa97], [TKY04] ([8], [9] in this table) Melting temperature (cont.) Uniform melting temperature To uniform Tm can eliminate a bias of hybridization. GC content The ratio of the # of G‟s and C‟s over the total # of nucleotides in a sequence G-C pair is more stable than A-T pair. Higher GC content implies higher Tm. Sequences are designed with 50% GC content. Gibbs free energy (ΔG) A well-known indicator of stability for DNA structures A structure with lower ΔG is more stable. The ΔG of entire structure is the sum of ΔG of each substructures [ZuSt81]. Secondary structures look like… Template method [ArKo02] Prepare 2 bit sequences, each of which has some desirable property (e.g., 50%-GC content, error-correction). Using convert rule, from these 2 sequences, we construct a sequence. Template method (cont.) Design criteria Template An element x should have at least d-mismatches with xR, xx, xR xR, xxR, xRx. An exhaustive search to find a good template Map (error-correcting code) A code whose words have at least k-mismatches. e.g. BCH code Drawback It cannot prevent sequences from forming secondary structures. AG-templates, GC-templates [KKA03] GC-template Template contains the same # of 0‟s and 1‟s (50% GC-content) Map is an error correcting code. AG-template Map is constant weight codes (50% GC-content) Results in the bigger set of sequences Other approaches DNASequenceGenerator [FBR00] A software with GUI Create a sequence with melting temperature, GC- content, no palindromes, start codons, nor restriction sites. Other approaches Suyama‟s approach [YoSu00] To generate sequences randomly, add it into a sequence set iff it satisfied all of the following constraints: Uniform melting temperature No mis-hybridization No formation of stable secondary structure Drawback is to fall into local optima easily. Other approaches Hybrid randomized neighborhoods [TuHo03] Stochastic local search (SLS) algorithm Searches neighbors by mutating current best sequences randomly with a probability ε. It moves to the direction where the # of constraint conflicts is maximally decreased with a probability 1-ε. Other approaches GA (genetic algorithm)-based approach [ANH00] Use GAs to evaluate fitness of solutions As criteria Restriction sites GC-content Hamming distance Same base repetition Other approaches Gibbs free energy base approach Taking thermodynamics into consideration Gibbs free energy as a stability measure Advantage Greater accuracy because it takes into account stability of loops or stacking between base-pairs Disadvantage More computational time to calculate free energy How to decrease this computational complexity? See [TKY05], [KNO08] A formal language approach Design a set of structure-free codes in terms of WK-complementary. Advantage More reliable codes than Free-energy approach More efficient algorithm for decision problems Disadvantage Need to consider each structure separately. A formal language approach (cont.) Abstracts of concepts {A, C, G, T} → an alphabet V, WK-complementarity → an antimorphic involution Involution • A mapping θ s.t. θ2 is identity (symmetry). Antimorphism • θ(xy) = θ(y)θ(x) (opposite direction). e.g. (TCATCCGATTTCGGG) = CCCGAAATCGGATGA TCATCCGATTTCGGG AGTAGGCTAAAGCCC Bond-free properties [KKS05] θ-non-overlapping: L ( L empty θ-compliant: w L, x, y , w, x w) y L xy Strictly (a) : a property (a) with θ-non-overlapping Bond-free properties [KKS05] θ-p-compliant: w L, y , w, w) y L y θ-s-compliant: w L, x , w, x w) L x Bond-free properties [KKS05] θ-free: L2 ( L) empty θ-sticky-free: w L, x, y , wx, y w) L xy Bond-free properties [KKS05] θ-3‟-overhang-free: w L, x, y , wx, w) y L xy θ-5‟-overhang-free: w L, x, y , xw, y w) L xy θ-overhang-free: both of these Decidability [KKS05] Theorem the following problem is decidable in quadratic time w.r.t. |A| Input: an NFA A, Output: Yes/No depending on whether L(A) satisfies any of the properties (or their strictly versions): • θ-compliant, θ-p-compliant, θ-s-compliant, • θ-sticky-free, • θ-3‟-overhang-free, θ-5‟-overhang-free, θ-overhang-free. Decidability and maximality [KKS05] Theorem Let M be a regular language and L is a regular subset of M with a property ρ: ρ is one of the followings: • θ-compliant, • θ-p-compliant, • θ-s-compliant, or • θ-sticky-free Then it is decidable whether L is a maximal subset of M satisfying ρ. Secondary structure prevention Secondary structures: Hairpin-loop (or simply hairpin) Internal loop Multiple-branch loop Pseudoknot They can be undesirable e.g. for Adleman‟s encoding technique for Hamiltonian Path Problem (HPP). Secondary Structures Hairpin Hairpin frame 5‟ (multiple loop) 3‟ 5‟ Internal loop 3‟ 5‟ A C G T 3‟ 3‟ G C C 5‟ Hairpin-free language A formal model of hairpin: x v y θ(v) z. TAA---ACG---CGTTA---CGT---CGGT x v y θ(v) z Hairpin freeness Intuitively it‟s almost impossible to prevent hairpins of short stack length (say 2 or 3). Our desire is to prevent any hairpin of stack length no less than some given parameter k. Hairpin-free language [KKL06] A word w is (θ, k)-hairpin-free (abbr. hp(θ, k)-free) iff w xvy (v) z | v | k. hpf(θ, k) : the set of all hp(θ, k)-free words on Σ* hp(θ, k) : Σ* - hpf(θ, k). A language L is called (θ, k)-hairpin-free iff L hpf ( , k ) Regularity of hairpin languages hp( , k ) X wX * ( w) X * * |w| k X X X w θ(w) hp(θ, k) and hpf(θ, k) are regular. For a hp(θ, k)-free language L, there exists a finite automaton M s.t. L = L(M). Hairpin Freedom Problems Hairpin-Freedom problem Input: A nondeterministic automaton M, Output: Y/N depending on whether L(M) is hp(θ, k)-free. Maximal Hairpin-Freedom problem Input: A deterministic automaton M1, and NFA M2. Output: Y/N depending on whether there is a word w L( M 2) L( M 1) s.t. L( M 1) {w} is hp(θ, k)-free. Decidability The hairpin-freedom problem for regular languages is decidable in O(| M |) time. The maximal hairpin-freedom problem for regular languages is decidable in O(| M 1 | | M 2 |) time. Hairpin Frames So-called Multiple loop hp-frame of degree n: x1v1 y1 (v1) z1...xnvnyn (vn) zn Figure is an example of hp- frame of degree 3. A word u is hp(fr, j)-word if it contains a hp-frame of degree j. Regularity & decidability hp(θ, fr, j) : the set of all hp(fr, j)-words on Σ* hpf(θ, fr, j) : its complement in Σ* The languages hp(θ, fr, j) & hpf(θ, fr, j) are regular. The hp(fr, j)-freedom problem is decidable in linear time. The maximal hp(fr, j)-freedom problem is decidable in O(| M 1 | | M 2 |) time. Application : DNA-HRAMs C G T A G C opening T A C G --A-C-T-G-T-C-G-A-C-A-G-T-- A T closing 0 1 n-bit DNA-HRAM consists of n hairpins. Each hairpin stores 1-bit information by forming and deforming a hairpin as shown above. n-bit DNA-HRAM Concatenation of n 1-bit RAM, which is equivalent to hp- frame of degree n. x1v1 y1 (v1) z1...xnvnyn (vn) zn In order for this word to work as n-bit RAM, the following subword should be hpf(θ, 20)-free. x1v1 y1z1...xnvnynzn DNA memory with 4 hairpins was proposed in [KYO08]. Reference [AlSa97] Allawi, HT., SantaLucia, J.: Thermodynamics and NMR of internal G T mismatches in DNA. Biochemistry 36(34) (1997) 10581-10594 [ArKo02] Arita, M., Kobayashi, S.: DNA sequence design using templates. New Generation Computing 20 (2002) 263-277 [ANH00] Arita, M., Nishikawa, A., Hagiya, M., Komiya, K., Gouzu, H., Sakamoto, K.: Improving sequence design for dna computing. Proc. Genetic and Evolutionary Computation Conference (2000) 875-882. [FBR00] Feldkamp, U., Saghafi, S., Rauhe, H.: A DNA sequence compiler. Proc. DNA6, (2000) [KKS05] Kari, L., Konstantinidis, S., Sosik, P.: Preventing undesirable bonds between DNA codewords. Prof. DNA10, LNCS 3384 (2005) 182-191. [KKL06] Kari, L., Konstantinidis, S., Losseva, E., Sosik, P., Thierrin, G.: A formal language analysis of DNA hairpin structures. Fundamenta Informaticae 71 (2006) 453-475 [KKA03] Kobayashi, S., Kondo, T., Arita, M.: On template method for DNA sequence design. Proc. DNA8, LNCS 2568 (2003) 205-214 Reference (cont.) [KNO08] Kawashimo, S., Ng, Y-K., Ono, H., Sadakane, K., Yamashita, M.: Speeding up local-search type algorithms for designing dna sequences under thermodynamical constraints. Proc. DNA14 (2008) 152-161 [KYO08] Kameda, A., Yamamoto, M., Ohuchi, A., Yaegashi, S., Hagiya, M.: Unravel four hairpins! Natural Computing 7 (2008) 287-298 [RFL01] Ruben, A. J., Freeland, S. J., Landweber, L. F.: PUNCH: An evolutionary algorithm for optimizing bit set selection. DNA7 (2001) 150-160 [Sha48] Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27 (1948) 379-423, 623-656 [TKY04] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Thermodynamic parameters based on a nearest-neighbor model for DNA sequences with a single-bulge loop. Biochemistry 43(22) (2004) 7143-7150 [TKY05] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Design of nucleic acid sequences for DNA computing based on a thermodynamic approach. Nucleic Acids Res. 33(3) (2005) 903-911 Reference (cont.) [TuHo03] Tulpan, D., Hoos, H.: Hybrid randomised neighbourhoods improve stochastic local search for dna code design. In Advances in Artificial Intelligence: 16th Conference of the Canadian Society for Computational Studies of Intelligence, 2671 (2003) 418-433 [YoSu00] Yoshida, H., Suyama, A.: Solution to 3-sat by breadth first search. Proc. the 5th DIMACS Workshop on DNA Based Computers, 54 (2000) 9-22 [ZuSt81] Zuker, M., Stiegler, P.: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9(1) (1981) 133-148