CMSC 838T Presentation by 7akgJz52


									  Accurate Method for Fast Design of
Diagnostic Oligonucleotide Probe Sets for
           DNA Microarrays

             Nazif Cihan Tas
            CMSC 838 Presentation
   DNA microarrays techniques are used intensely for
    identification of biological agents
       Gene Expression Studies
       Diagnostic Purposes
          Identification of Microorganisms in samples
       Item Extraction
   Complex Problem
       Find the necessary probes and the temperature
       Probe sets should be reliably detect and differentiate target
       Large Databases
       NEW!! Homologous Genes (how to find specific probes)

                           CMSC 838T – Presentation
                          Talk Overview
   Overview of talk
       Motivation
       Problem Statement
       Algorithm
       Mathematical Aspects
       Experimentation
       Discussion

                            CMSC 838T – Presentation
                       Problem Statement
   Positive Probes
       Database set S0
       Target S1
       For each sequence in S1, find at least one probe
       For S0 - S1 try to avoid it (but do not care if happens)

                            S1                            S0

       High Specificity: # of non-target matches are minimized
       High Sensitivity: # of covered target seq. is maximized

                            CMSC 838T – Presentation
                       Problem Statement
   Negative Probes
       Determine as few as possible probes which together hybridizes
        with all sequences in S0 - S1 but with NONE in S1.

                           S1                            S0

       High Specificity: No seq. in S1 may hybridize
       High Sensitivity: Max # of seq. in S0 - S1 be covered

                            CMSC 838T – Presentation
                 Probe Design Constraints
   Sequence Related
       Length of probes
       Deviation of melting temperature of probe-target hybrids must
        be low (for physical reasons)
       No self complementary regions longer than four nucleotides
        (not descriptive enough)
       Melting temperatures of target and non-target seq. must be
        larger than a predefined (too close, too hard to identify)
           Ensuring a minimum number of mismatches is enough

            (homologous sequences)
   System Related
       Execution Time
       Usability

                           CMSC 838T – Presentation
   Overview
       Probe Generation
       Hybridization Prediction
       Probe Selection

                           CMSC 838T – Presentation
                          Probe Generation

   Subproblem:
       Generate probe candidates for the sequences
       Keep the set as small as possible without losing any optimal
        candidate (exclude infeasible ones)
   Suffix Tree
       Why?
          Allows fast recognition of repetitive subsequences
          Identifies non-unique probes (i.e. with more than one target)

          Efficient for memory and for T computation (reduce time)

       How?
            Tree is constructed from the sequences
            Traversed (Watson-Crick complement)

                            CMSC 838T – Presentation
                              Suffix Tree
   Input: TACTACA
       TACTACA
       ACTACA
       CTACA
       TACA
       ACA
       CA
       A
   $ denotes end of string

   Constructed in linear time

                          CMSC 838T – Presentation
                          Probe Generation
   Further Improvements
       Filters applied for cut off
            Probe length (predefined)
            G-C content (for temperature)
            Self-complementarity
                 Probes should not contain complements as subsequences
       Finally, remove highly conserved (non-specific) regions
       Insert into hashtables according to their lengths

                              CMSC 838T – Presentation

CMSC 838T – Presentation
                       Hybridization Prediction
   Subproblem:
        Search for the right probe
        Search is expensive, Intelligent Hashing used
   Design
        A frame is moved over target and nontarget seqs. with several
             Previous algorithm (Kaderali 2002): Use the suffix tree

        At each step, hash values are calculated. If hit, predict melting
         temperature, store in hybridization matrix.
        If there are too many hits for a probe, then it is not unique, remove it
        Why intelligent?
             Hash time is linear

             Allows inexact matching because of hashing (No analysis)

   Parallelization
        Several threads are searching for probe targets.
           Tree and hashtables are fixed.

        One thread writes to the final matrix
                               CMSC 838T – Presentation
                    Hybridization Prediction

   Empirical Simulation:
       One million random probe-target pairings generated
       Four mismatches or one insertion or deletion plus one strong
        central mismatch chosen
       T<20 C for 93%

       Complexity is O ( |S0| |S1| )
          Possible probe candidates is |S1| (linear)
          Each position of database S0 must be checked

                          CMSC 838T – Presentation

CMSC 838T – Presentation
                            Probe Selection
   Subproblem:
       Use the hybridization matrix to finalize the probe selection
          We have positive probes and negative probes to proceed

   Algorithm Analysis:
       For each probe candidate
          g: #of matches in S1

          b: #of matches in S0 - S1

          t: highest melting point in S1

       Probes for which g or b values is too large, are removed
       Sort according to g,b and t.
       Apply Depth First Search
   Advantages
       Performs well (No comparison though)
       Guarantees to choose all specific probes if any were found.
   Disadvantages
       can NOT guarantee optimal selection in terms of coverage
                             CMSC 838T – Presentation
                  Negative Probe Selection
   Let S2 =S0 - S1 and B subset of S2 . The probes that
    detect S1 also detects some of B elements.
   Algorithm for Negative Probes
       Apply probe generating and preselection for B.
       Conduct hybridization on B U S1 .
       Remove the probes which hybridizes with S1 .
       Sort the remaining probes according to their hit number.
       Successively select the probes which covers most target seq.
   Guarantees optimal solution for coverage and probe
    number usage

                           CMSC 838T – Presentation
Probe Selection

CMSC 838T – Presentation
Mathematical Aspects

    CMSC 838T – Presentation
   Parallelized on SMP platform
       Classic worker-producer
       Intel Dual Pentium III 933 MHz, 1 GB memory
   Test data
       ssu rRNA of ARB project
       20.282 ssu rRNA sequences
       1.401 < lengths < 4.179
       %97 of them are shorter than 2.000 bases

                           CMSC 838T – Presentation
   High Performance
       Execution is linear with size of database, decreases if longer
        probes are used
   Low Memory Consumption
       Depends on the size of the sequence selection, NOT database
   Automatic Design of Group Probes and negative
   High Quality Probe Design

                           CMSC 838T – Presentation
   Comparison with previous work
       vs. ARB
           Not suited for large scale probe design
       vs. LCF
           Does not consider highly conserved data
          Memory consumption is high

          Works well with short probes only

       vs. others
            Mostly can not deal with insertion and deletions
            Execution is slow

                             CMSC 838T – Presentation

To top