Combinatorial Optimization Problems in Computational Biology Design and Optimization of by wuzhengqin

VIEWS: 0 PAGES: 16

									Design and Optimization of
  Universal DNA Arrays


        Ion Mandoiu

  CSE Department & BME Program
     University of Connecticut
                  DNA Microarrays
• Exploit Watson-Crick complementarity to simultaneously
  perform a large number of substring tests

• Used in a variety of high-throughput genomic analyses
   –   Transcription (gene expression) analysis
   –   Single Nucleotide Polymorphism (SNP) genotyping
   –   Genomic-based microorganism identification
   –   Alternative splicing, ChIP-on-chip, tiling arrays,…

• Common microarray formats involve direct
  hybridization between labeled DNA/RNA sample and
  DNA probes attached to a glass slide
          Universal DNA Arrays
• Limitations of direct hybridization formats:
   – Arrays of cDNAs: inexpensive, but can only be used for
     transcription analysis
   – Oligonucleotide arrays: flexible, but expensive unless
     produced in large quantities
• Universal DNA arrays: “programable” arrays
   – Array consists of application independent oligonucleotides
   – Detection carried by a sequence of reactions involving
     application specific primers
   – Flexible AND cost effective
• Universal array architectures: DNA tag arrays, APEX
  arrays, SBE/SBH arrays
  Tag
         Primer
                     DNA Tag Arrays

              +

Mix tag+primer probes with genomic DNA
                                         Solution phase hybridization




                          Antitag



    Solid phase hybridization
                                            Single-Base Extension
      Tag Hybridization Constraints

   t1 t1   t2 t2           t1   t1 t2



(H1) Tags hybridize strongly to complementary antitags
(H2) No tag hybridizes to a non-complementary antitag
(H3) Tags do not cross-hybridize to each other

Tag Set Design Problem: Find a maximum cardinality set
  of tags satisfying (H1)-(H3)
              Hybridization Models
• Hamming distance model, e.g., [Marathe et al. 01]
   – Models rigid DNA strands


• LCS/edit distance model, e.g., [Torney et al. 03]
   – Models infinitely elastic DNA strands


• c-token model [Ben-Dor et al. 00]:
   – Duplex formation requires formation of nucleation complex
     between perfectly complementary substrings
   – Nucleation complex must have weight  c, where
     wt(A)=wt(T)=1, wt(C)=wt(G)=2 (2-4 rule)
                 c-h Code Problem
• c-token: left-minimal DNA string of weight  c, i.e.,
   – w(x)  c
   – w(x’) < c for every proper suffix x’ of x
• A set of tags is a c-h code if
   (C1) Every tag has weight  h
   (C2) Every c-token is used at most once

c-h Code Problem [Ben-Dor et al.00]
Given c and h, find maximum cardinality c-h code

[Ben-Dor et al.00] give approximation algorithm based on
DeBruijn sequences
          Periodic Tags [MT05]
• Key observation: c-token uniqueness constraint
  in c-h code formulation is too strong
  – A c-token should not appear in two different tags, but
    can be repeated in a tag
  – Periodic tags use fewer c-tokens!


 Tag set design can be cast as a cycle packing
 problem
c-token factor graph, c=4 (incomplete)



                      CC
       AAG                 AAC
               AAAA
                           AAAT
           Cycle Packing Algorithm
1.   Construct c-token factor graph G
2.   T{}
3.   For all cycles C defining periodic tags, in increasing order of
     cycle length,
       •   Add to T the tag defined by C
       •   Remove C from G
4.   Perform an alphabetic tree search and add to T tags consisting
     of unused c-tokens
5.   Return T

– Gives an increase of over 40% in the number of tags
compared to previous methods
 More Hybridization Constraints…


           t1
   t1           t2




• Enforced during tag assignment by
   - Leaving some tags unassigned and distributing primers across
   multiple arrays [Ben-Dor et al. 03]
   - Exploiting availability of multiple primer candidates [MPT05]
Herpes B Gene Expression Assay
GenFlex Tags
                Pool           500 tags               1000 tags               2000 tags
 Tm   # pools
                size   # arrays     % Util.   # arrays     % Util.    # arrays     % Util.
                 1         4          82.26       3          65.35        2          57.05
 60    1446
                 5         4          88.26       3          70.95        2          63.55
                 1         4          86.33       3          69.70        2          61.15
 67    1560
                 5         4          91.86       3          76.00        2          67.20
                 1         4          88.46       3          73.65        2          65.40
 70    1522
                 5         4          92.26       2          91.10        2          70.30


Periodic Tags
                Pool           500 tags               1000 tags               2000 tags
 Tm   # pools
                size   # arrays     % Util.   # arrays     % Util.    # arrays     % Util.
                 1         4          94.06       2           97.20       1           72.30
 60    1446
                 5         4          96.13       2          100.00       1           72.30
                 1         4          96.53       2           98.70       1           78.00
 67    1560
                 5         4          98.00       2           99.90       1           78.00
                 1         4          96.73       2           98.90       1           76.10
 70    1522
                 5         4          97.80       2           99.80       1           76.10
Primer
       New SBE/SBH Assay
        T              T

        A                       A

        T                  T



                    TTGCA
AA AC CC CA
                        T
AT   AG CG   CT

TT   TG GG GT                   A
                     GATAA
TA   TC GC GA
                            T
                               SBE/SBH Throughput (c=13, r=5)
                                                                                                       200k SNPs
                                                                                                       (1 primer)
                              40000
                                                                                                       200k SNPs
                                                                                                       (2 primers)
# SNPs assignable per array




                              35000
                                                                                                       100k SNPs
                                                                                                       (1 primer)
                              30000
                                                                                                       100k SNPs
                                                                                                       (2 primers)
                              25000
                                                                                                       20k SNPs (1
                                                                                                       primer)
                              20000
                                                                                                       20k SNPs (2
                              15000                                                                    primers)
                                                                                                       10k SNPs (1
                              10000                                                                    primer)
                                                                                                       10k SNPs (2
                               5000                                                                    primers)
                                                                                                       2k SNPs (1
                                  0                                                                    primer)
                                      10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30   2k SNPs (2
                                                                                                       primers)
                                                              Primer Length


                              See poster for more details
  Conclusions and Ongoing Work
• Combinatorial algorithms yield significant increases in
  multiplexing rates of universal DNA arrays
   – New SBE/SBH architecture particularly promising based on
     preliminary simulation results
• Ongoing work:
   – Extend methods to more accurate hybridization models, e.g.,
     use NN melting temperature models
   – More complex (e.g., temperature dependent) DNA tag set
     non-interaction requirements for DNA self/mediated
     assembly
   – Probabilistic decoding in presence of hybridization errors
           Acknowledgments

• UCONN Research Foundation
• Claudia Prajescu
• Dragos Trinca

								
To top