# Encoding Information for DNA computing by dkw12103

VIEWS: 101 PAGES: 45

• pg 1
```									Encoding Information for
DNA computing
Shinnosuke Seki
Purpose
 What‟s an advantage of encoding?

 To make a “good” or tractable code set for DNA
computing.

 Development of polynomial-time algorithms
which decide whether a given code set is “good”
Claude Elwood Shannon
 The father of information
theory (Shannon‟s entropy)
 Boolean algebra with binary
arithmetic makes it possible to
simplify electromechanical
relays
 In “A mathematical theory of
communication” [Sha48], he
showed that we can send
error-free information even on
noisy channel.
 Chess program using minimax
evaluation procedure
 etc. …
Shannon‟s information channel
Positive Noise

capacity C

Information flow R

Negative Noise

 R>C        overflow
 R≤C        We can make the error rate as small as possible.

 To attain R = C in the noisy channel, we need to find a
„good‟ code.
Biological perspective
 Every biological reaction is an information
channel model.
example The case of heredity
Natural Selection

parent   DNA        heredity        DNA   child

Mutation

 For billions of years, Mother Nature has
developed wonderful code system?
 Biology -> Computer Science
Review:   in vitro DNA computing
1.   Encode a given problem into single or double-stranded
DNAs (ssDNAs, dsDNAs)
2.   Computation by a succession of bio-operations.
3.   Decode the resulting solution and extract its output.
Review:   WK-complementarity

 Hydrogen bonds A              T    C   G
     Two strands which are
1. complementary to each other
2. with opposite directions
can form a (complete) dsDNA.

    Example

5‟ - A T C G G T C A A C T G C C C T A A T G  3‟
3‟  T A G C C A G T T G A C G G G A T T A C - 5‟
 Find a solution of Hamiltonian path problem in a solution
in polynomial time order of the input graph.
 The solution is filled with encoding oligonucleotides.

1                 3

1       2       3       4
ACG CTT ATA GAT CGG TTA ACT TAA
GAA TAT CTA GCC AAT TGA
1 -> 2   2 -> 3   3 -> 4

2                 4
What‟s a good code set?
 Each code word (oligonucleotide) shouldn‟t form any
undesirable structure.
A T     A
2
ATA GAT
T A     G

 This may make itself inert.
 Code words don‟t interact with each other in an
undesirable way.
 Structure formation is due to
 WK-complementarity
 Gibbs free energy
What‟s a good code set? (cont.)
 Uniform melting temperature
 Preventing undesirable hybridizations
 Other constraints
 Avoiding repeated bases
 Forbidden subsequences
 Using a restriction enzyme, its corresponding
recognition site should appear only in intended sites
 Using only 3 types of nucleotides A, C, T
Melting temperature

 Melting temperature Tm of a dsDNA is
 the temperature at which half of the dsDNAs is
denatured.
 The higher Tm is, the more stable the dsDNA is.
H
   vm 
R ln( Ct /    S
•   R: gas constant,
•   Ct: total oligo concentration,
•   ΔH & ΔS : enthalpy & entropy
•   α: 1 for self-complementary and 4 for non-self
Nearest-neighborhood method
Refer to [AlSa97], [TKY04] ([8], [9] in this table)
Melting temperature (cont.)
 Uniform melting temperature
 To uniform Tm can eliminate a bias of hybridization.

 GC content
 The ratio of the # of G‟s and C‟s over the total # of
nucleotides in a sequence
 G-C pair is more stable than A-T pair.
 Higher GC content implies higher Tm.
 Sequences are designed with 50% GC content.
Gibbs free energy (ΔG)
 A well-known indicator of stability for DNA structures
 A structure with lower ΔG is more stable.
 The ΔG of entire structure is the sum of ΔG of each
substructures [ZuSt81].
Secondary structures look like…
Template method [ArKo02]
 Prepare 2 bit sequences, each of which has some
desirable property
 (e.g., 50%-GC content, error-correction).
 Using convert rule, from these 2 sequences, we
construct a sequence.
Template method (cont.)

 Design criteria
 Template
 An element x should have at least d-mismatches
with xR, xx, xR xR, xxR, xRx.
 An exhaustive search to find a good template
 Map (error-correcting code)
 A code whose words have at least k-mismatches.
 e.g. BCH code
 Drawback
 It cannot prevent sequences from forming secondary
structures.
AG-templates, GC-templates [KKA03]
 GC-template
 Template contains the
same # of 0‟s and 1‟s
(50% GC-content)
 Map is an error correcting
code.
 AG-template
 Map is constant weight
codes (50% GC-content)
 Results in the bigger set of
sequences
Other approaches
 DNASequenceGenerator [FBR00]
 A software with GUI
 Create a sequence with melting temperature, GC-
content, no palindromes, start codons, nor restriction
sites.
Other approaches
 Suyama‟s approach [YoSu00]
 To generate sequences randomly, add it into a
sequence set iff it satisfied all of the following
constraints:
 Uniform melting temperature
 No mis-hybridization
 No formation of stable secondary structure
 Drawback is to fall into local optima easily.
Other approaches
 Hybrid randomized neighborhoods [TuHo03]
 Stochastic local search (SLS) algorithm
 Searches neighbors by mutating current best
sequences randomly with a probability ε.
 It moves to the direction where the # of constraint
conflicts is maximally decreased with a probability 1-ε.
Other approaches
 GA (genetic algorithm)-based approach [ANH00]
 Use GAs to evaluate fitness of solutions
 As criteria
 Restriction sites
 GC-content
 Hamming distance
 Same base repetition
Other approaches
 Gibbs free energy base approach
 Taking thermodynamics into consideration
 Gibbs free energy as a stability measure
 Greater accuracy because it takes into account
stability of loops or stacking between base-pairs
 More computational time to calculate free energy
 How to decrease this computational complexity?
 See [TKY05], [KNO08]
A formal language approach
 Design a set of structure-free codes in terms of
WK-complementary.
More reliable codes than Free-energy approach
More efficient algorithm for decision problems
Need to consider each structure separately.
A formal language approach (cont.)
 Abstracts of concepts
 {A, C, G, T} → an alphabet V,
 WK-complementarity → an antimorphic involution
 Involution
• A mapping θ s.t. θ2 is identity (symmetry).
 Antimorphism
• θ(xy) = θ(y)θ(x) (opposite direction).

 e.g. (TCATCCGATTTCGGG) = CCCGAAATCGGATGA

TCATCCGATTTCGGG

AGTAGGCTAAAGCCC
Bond-free properties [KKS05]

 θ-non-overlapping: L   ( L   empty

 θ-compliant: w  L, x, y   , w, x  w) y  L  xy  

 Strictly (a) : a property (a) with θ-non-overlapping
Bond-free properties [KKS05]
 θ-p-compliant: w  L, y   , w,  w) y  L  y  

 θ-s-compliant: w  L, x   , w, x  w)  L  x  
Bond-free properties [KKS05]
 θ-free: L2    ( L)   empty

 θ-sticky-free: w  L, x, y   , wx, y  w)  L  xy  
Bond-free properties [KKS05]

 θ-3‟-overhang-free: w  L, x, y   , wx,  w) y  L  xy  

 θ-5‟-overhang-free: w  L, x, y   , xw, y  w)  L  xy  

 θ-overhang-free: both of these
Decidability [KKS05]
 Theorem
 the following problem is decidable in quadratic time
w.r.t. |A|
 Input: an NFA A,
 Output: Yes/No depending on whether L(A) satisfies
any of the properties (or their strictly versions):
• θ-compliant, θ-p-compliant, θ-s-compliant,
• θ-sticky-free,
• θ-3‟-overhang-free, θ-5‟-overhang-free, θ-overhang-free.
Decidability and maximality [KKS05]
 Theorem
 Let M be a regular language and L is a regular subset
of M with a property ρ:
 ρ is one of the followings:
•   θ-compliant,
•   θ-p-compliant,
•   θ-s-compliant, or
•   θ-sticky-free
 Then it is decidable whether L is a maximal subset of
M satisfying ρ.
Secondary structure prevention
 Secondary structures:
 Hairpin-loop (or simply hairpin)
 Internal loop
 Multiple-branch loop
 Pseudoknot
 They can be undesirable
 e.g. for Adleman‟s encoding technique for Hamiltonian
Path Problem (HPP).
Secondary Structures
Hairpin
Hairpin frame
5‟
(multiple loop)

3‟

5‟

Internal loop        3‟

5‟     A C G T        3‟

3‟       G C C        5‟
Hairpin-free language
 A formal model of hairpin: x v y θ(v) z.

TAA---ACG---CGTTA---CGT---CGGT
x      v        y      θ(v)      z

 Hairpin freeness
Intuitively it‟s almost impossible to prevent hairpins of
short stack length (say 2 or 3).
Our desire is to prevent any hairpin of stack length no
less than some given parameter k.
Hairpin-free language [KKL06]
 A word w is (θ, k)-hairpin-free (abbr. hp(θ, k)-free) iff
w  xvy (v) z | v | k.

 hpf(θ, k) : the set of all hp(θ, k)-free words on Σ*
 hp(θ, k) : Σ* - hpf(θ, k).

 A language L is called (θ, k)-hairpin-free iff

L  hpf ( , k )
Regularity of hairpin languages
 hp( , k )    X            wX * ( w) X *
*

|w|  k

X                                   X           X

w                    θ(w)

 hp(θ, k) and hpf(θ, k) are regular.

 For a hp(θ, k)-free language L, there exists a finite
automaton M s.t. L = L(M).
Hairpin Freedom Problems
 Hairpin-Freedom problem
Input: A nondeterministic automaton M,
Output: Y/N depending on whether L(M) is hp(θ, k)-free.

 Maximal Hairpin-Freedom problem
Input: A deterministic automaton M1, and NFA M2.
Output: Y/N depending on whether there is a word
w  L( M 2)  L( M 1) s.t. L( M 1)  {w} is hp(θ, k)-free.
Decidability
 The hairpin-freedom problem for regular languages is
decidable in O(| M |) time.

 The maximal hairpin-freedom problem for regular
languages is decidable in O(| M 1 |  | M 2 |) time.
Hairpin Frames
 So-called Multiple loop
 hp-frame of degree n:
x1v1 y1 (v1) z1...xnvnyn (vn) zn

 Figure is an example of hp-
frame of degree 3.
 A word u is hp(fr, j)-word if it
contains a hp-frame of
degree j.
Regularity & decidability
 hp(θ, fr, j) : the set of all hp(fr, j)-words on Σ*
 hpf(θ, fr, j) : its complement in Σ*

 The languages hp(θ, fr, j) & hpf(θ, fr, j) are regular.

 The hp(fr, j)-freedom problem is decidable in linear
time.
 The maximal hp(fr, j)-freedom problem is decidable
in O(| M 1 |  | M 2 |) time.
Application : DNA-HRAMs
C       G
T               A
G       C       opening
T       A
C       G                 --A-C-T-G-T-C-G-A-C-A-G-T--
A       T
closing
0                                  1
 n-bit DNA-HRAM consists of n hairpins.
 Each hairpin stores 1-bit information by forming and
deforming a hairpin as shown above.
n-bit DNA-HRAM
 Concatenation of n 1-bit RAM, which is equivalent to hp-
frame of degree n.
x1v1 y1 (v1) z1...xnvnyn (vn) zn
 In order for this word to work as n-bit RAM, the following
subword should be hpf(θ, 20)-free.

x1v1 y1z1...xnvnynzn
 DNA memory with 4 hairpins was proposed in [KYO08].
Reference

 [AlSa97] Allawi, HT., SantaLucia, J.: Thermodynamics and NMR of internal
G T mismatches in DNA. Biochemistry 36(34) (1997) 10581-10594
 [ArKo02] Arita, M., Kobayashi, S.: DNA sequence design using templates.
New Generation Computing 20 (2002) 263-277
 [ANH00] Arita, M., Nishikawa, A., Hagiya, M., Komiya, K., Gouzu, H.,
Sakamoto, K.: Improving sequence design for dna computing. Proc. Genetic
and Evolutionary Computation Conference (2000) 875-882.
 [FBR00] Feldkamp, U., Saghafi, S., Rauhe, H.: A DNA sequence compiler.
Proc. DNA6, (2000)
 [KKS05] Kari, L., Konstantinidis, S., Sosik, P.: Preventing undesirable bonds
between DNA codewords. Prof. DNA10, LNCS 3384 (2005) 182-191.
 [KKL06] Kari, L., Konstantinidis, S., Losseva, E., Sosik, P., Thierrin, G.: A
formal language analysis of DNA hairpin structures. Fundamenta
Informaticae 71 (2006) 453-475
 [KKA03] Kobayashi, S., Kondo, T., Arita, M.: On template method for DNA
sequence design. Proc. DNA8, LNCS 2568 (2003) 205-214
Reference (cont.)

 [KNO08] Kawashimo, S., Ng, Y-K., Ono, H., Sadakane, K., Yamashita, M.:
Speeding up local-search type algorithms for designing dna sequences
under thermodynamical constraints. Proc. DNA14 (2008) 152-161
 [KYO08] Kameda, A., Yamamoto, M., Ohuchi, A., Yaegashi, S., Hagiya, M.:
Unravel four hairpins! Natural Computing 7 (2008) 287-298
 [RFL01] Ruben, A. J., Freeland, S. J., Landweber, L. F.: PUNCH: An
evolutionary algorithm for optimizing bit set selection. DNA7 (2001) 150-160
 [Sha48] Shannon, C.E.: A mathematical theory of communication. Bell
System Technical Journal 27 (1948) 379-423, 623-656
 [TKY04] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.:
Thermodynamic parameters based on a nearest-neighbor model for DNA
sequences with a single-bulge loop. Biochemistry 43(22) (2004) 7143-7150
 [TKY05] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Design of
nucleic acid sequences for DNA computing based on a thermodynamic
approach. Nucleic Acids Res. 33(3) (2005) 903-911
Reference (cont.)

 [TuHo03] Tulpan, D., Hoos, H.: Hybrid randomised neighbourhoods improve
stochastic local search for dna code design. In Advances in Artificial
Intelligence: 16th Conference of the Canadian Society for Computational
Studies of Intelligence, 2671 (2003) 418-433
 [YoSu00] Yoshida, H., Suyama, A.: Solution to 3-sat by breadth first search.
Proc. the 5th DIMACS Workshop on DNA Based Computers, 54 (2000) 9-22
 [ZuSt81] Zuker, M., Stiegler, P.: Optimal computer folding of large RNA
sequences using thermodynamics and auxiliary information. Nucleic Acids
Res. 9(1) (1981) 133-148

```
To top