Docstoc

DIPS ppt Counting position weight matrices in a sequence an

Document Sample
DIPS ppt Counting position weight matrices in a sequence an Powered By Docstoc
					    Counting position weight matrices in a
sequence & an application to discriminative
                               motif finding



 Saurabh Sinha
 Computer Science
 University of Illinois, Urbana-Champaign
Transcriptional Regulation

        TRANSCRIPTION
           FACTOR        GENE
        ACAGTGA




                        PROTEIN
Transcriptional Regulation

        TRANSCRIPTION
           FACTOR        GENE
        ACAGTGA




                        PROTEIN
Binding sites and motifs
 Transcription factor binding sites in a
  gene’s neighborhood are the fundamental
  units of the regulatory network
 Transcription factor binding is specific,
  hence binding sites are similar to each
  other, but variability is often seen
 A motif is the common sequence pattern
  among binding sites of transcription factor
Motif models
 Consensus string, e.g., ACGWGT
 Position Weight Matrix (PWM)
      Position Weight Matrix

          ACCCGTT
Binding   ACCGGTT
sites     ACAGGAT
          ACCGGTT
          ACATGAT
          5 0 2 0 0 2 0 A
          0 5 3 1 0 0 0 C
          0 0 0 3 5 0 0 G
          0 0 0 1 0 3 5 T


          PWM
Databases of PWMs
 Transfac has ~100s of PWMs for human
 Jaspar: a smaller, perhaps better curated
  database of PWMs
 Organism specific databases coming up
  frequenctly
 PWMs in databases often derived from
  experimentally validated binding sites
Bioinformatics of PWMs
   Popular motif model
   i.e., several motif finding algorithms that attempt
    to find PWMs from sequences
   Gibbs sampling: one of the earliest; tries to
    sample PWMs with high “relative entropy”
   MEME: another early algorithm; uses
    expectation maximization to find PWMs that
    best “model the sequences”
   Many more algorithms to find PWMs from a set
    of sequences
Problem: counting motifs
   Given DNA sequence, and a consensus motif
    (say “ACGWGT”), count the motif in the
    sequence
   Trivial solution
   What if the motif is a Position Weight Matrix
    (PWM) ?
   Why hasn’t this problem been looked at?
   Because previous algorithms used different
    scores of PWMs: how “sharp” they are, how
    well they explain data, etc.
Counting matches to a PWM: A possibility
   For each site s in sequence, compute



 If Pr(s | W) > some threshold, call s a site
 Count number of sites in sequence
 No distinction between strong and weak
  sites, as long as they are above threshold
       binary scheme, not realistic
A wish-list (for the score)
   Score should consider both strong and weak
    occurrences of motif
   Score should assign appropriate weights to
    strong and weak occurrences
   Score should be aware that there may also be
    sites of other known motifs in the sequence
   The list goes on: score should be efficiently
    computable, score should be differentiable,
    score should …
The “w-score”
 Defined by a probabilistic model of
  sequence generation
 Given one or more motifs, and a
  background distribution, defines a
  probability space on sequences
 A simple (zeroth order) Hidden Markov
  model (HMM)
    Probabilistic Model: toy example
                       Given two motifs W1,W2, a
                        “background” motif Wb, and a
               W1       sequence length L
                       Pr(Wi  Wj) = pj
       Wb
                           transition probability
               W2      When in state Wi, emit a
                        substring s chosen with
                        probability Pr(s | Wi)
                           emission probability
                       Stop when length of emitted
                        sequence is L
A stochastic process generating sequences of length L
A “path” through the HMM
                                 One possible path T1
Wb        Wb        Wb

               W1        W2
     W1



                              Another    possible path T2
Wb              Wb

      W2                 W2
Likelihood of sequence & paths
                                 A path of the HMM defines the
                                  locations of motif matches
Wb        Wb        Wb
                                 For a sequence S & a path T,
               W1        W2       the joint probability Pr(S,T) is
     W1
                                  easy to compute
                                 Conditional probability of a
                                  path T, given the data S, is:
Wb              Wb

      W2                 W2
                                 Strong matches make the
                                  probability higher
                                 Paths with weak matches have
                                  lower conditional probabilities
The “w-score”
 Let the number of occurrences of a motif
  (say W1) in path T be
 Compute:




   In words: An average of the motif count
           , with weights equal to the
    probability of T given S
The “w-score” (Cont’d)
 Score depends both on number and
  quality of matches to motif.
 Every substring is a potential binding site,
  and paths placing the motif there will
  contribute to the count
 Pr(T | S) depends on the match strength
  of all motifs, not just the one being
  counted
  The wish-list (again)
  Score should give consider both strong
    and weak occurrences of motif
  Score should assign appropriate weights
    to strong and weak occurrences
  Score should be aware that there may
    also be sites of other known motifs in the
    sequence
    An exciting new feature of this motif score
Computational pros and cons
 The w-score computation takes
  time, where L is sequence length, and lm
  is the motif length. This is relatively
  expensive
 The w-score can be differentiated with
  respect to all of the PWM parameters in
  time
       Important feature for search algorithms
Using the “w-score” in
discriminative motif finding
Discriminative motif finding
   Suppose we have a set of co-regulated genes,
    i.e., we believe they have binding sites of the
    same transcription factor (in their regulatory
    control regions)
   Traditionally, motif finding tries to find these
    binding sites, based on over-representation,
    conservation etc.
   Often we also know a set of genes that should
    NOT have binding sites of that transcription
    factor
   Examples: ChIP-on-chip, In situ hybridization
    pictures of Drosophila embryo, etc.
Problem formulation
 Given two sets of sequences S+ and S-
 Find a motif that has many occurrences in
  S+ and few occurrences in S-
 Maximize the difference in the average
  counts of the motif in the two sets
 Let W(S) = count of a motif W in
  sequence S
 Maximize:
Optimization problem
   Find motif W that maximizes
Derivatives of objective function
 Let Wk be the PWM entry for base  in
  column k
 We can efficiently compute


   We can efficiently differentiate our
    objective function
Algorithm
 Search space: Set of n = 20 substrings of
  sequences in S+ (called “site set”)
 Objective function: Construct PWM W
  from site-set, compute score
 Length of sites is user-defined
Algorithm
            Current site-set C




                 S+
Algorithm
   Replace one site with any site from sequence




                     S+
 Pick a replacement that improves objective function
Algorithm
 Current solution (site-set): C
 Candidate new solution: C
 Many possibilities for C (every substring
  of every sequence in S+ is a possible
  replacement)
 Evaluate objective function on each
  candidate C
       Too slow !
   Use derivative information !
Algorithm
 Estimate the objective function value for
  each candidate C using partial
  derivatives and first order approximation
 Examine each candidate in decreasing
  order of estimated score
 If a candidate C found with greater score
  than C, choose it.
    Algorithm illustration
 Current score = 12          Estimated scores
                             10
                                  Accurate score


Accurate score

             13


  Accurate score      11
Algorithm Properties
 Objective function has many desirable
  properties, but is an expensive operation
 Derivative computation has the same time
  complexity, and is used to guide search
 Avoids local optima by searching in a
  discretized PWM space
 Performs significantly better and/or faster
  than Gibbs sampling and Conjugate
  Gradients, for this particular score
Discriminative PWM Search (DIPS)
 Software available
 Can easily handle data sets of ~100
  sequences
 Can find multiple motifs iteratively, but
  without masking:
       Find a PWM, then include it in the model as a
        known PWM, find another PWM, and so on
Performance tests
 Tested on synthetic data
 Compared to traditional motif finder as
  well as two discriminative motif finders
 Superior performance in the presence of
  “distractor” motifs
       it really helps to be able to count a motif in
        the presence of other known motifs
Tests on Drosophila Enhancers

  200


  180


  160
                        BICOID (ACTIVATOR)
  140


  120


  100


  80


  60


  40


  20


   0
        100   80   60        40    20




 HEAD                                   TAIL
Tests on Drosophila Enhancers

 100

             CAUDAL (ACTIVATOR)



  50




  0
       100       80    60    40   20




 HEAD                                  TAIL
DIPS runs
 S+ = promoters of genes expressed in
  anterior half of embryo
 S- = promoters of genes expressed in
  posterior half of embryo
 Top motif: Bicoid !
      200


      180


      160


      140                   BICOID (ACTIVATOR)
      120


      100


      80


      60


      40


      20


       0
            100   80   60       40    20




     HEAD                                  TAIL
DIPS runs
 S+ = promoters of genes expressed in
  posterior half of embryo
 S- = promoters of genes expressed in
  anterior half of embryo
 Top motif: Caudal !
     100




          CAUDAL (ACTIVATOR)


     50




      0
           100   80   60   40   20




      HEAD                           TAIL
 Summary of results
Phase                S+               S-       Found motif   Best match Pvalue of match
              anterior 50%    posterior 50%    anterior.1    bicoid           0.00
              posterior 50%   anterior 50%     posterior.1   caudal           0.03
1 (activator)
              terminal 40%    middle 80%       terminal.1    torRE            0.00
              middle 40%      0-30%, 70-100%   centrall.1    hunchback        0.00
              80-100% EL      60-80% EL        5.1.1         torRE            0.10
              60-80% EL       80-100%,40-60%   5.2.1         caudal           0.06
2 (repressor) 40-60% EL       60-80%, 20-40%   5.3.1         kruppel          0.25
              20-40% EL       0-20%, 40-60%    5.4.1         knirps           0.03
              0-20% EL        20-40% EL        5.5.1         Dichaete         0.07
              anterior 50%    posterior 50%    anterior.2    huckebein        0.02
              posterior 50%   anterior 50%     posterior.2   pdm1_2           0.00
3 (activator)
              terminal 40%    middle 80%       terminal.2    caudal           0.06
              middle 40%      0-30%, 70-100%   centrall.2    huckebein        0.12
              80-100% EL      60-80% EL        5.1.2         knirps           0.01
              60-80% EL       80-100%,40-60%   5.2.2         torRE            0.01
4 (repressor) 40-60% EL       60-80%, 20-40%   5.3.2         giant            0.09
              20-40% EL       0-20%, 40-60%    5.4.2         giant            0.05
              0-20% EL        20-40% EL        5.5.2         bicoid           0.18
Social regulation in honey bee
 Transition from nursing in the hive to
  foraging for food is age related, but also
  regulated by the needs of the colony
 32 genes demonstrated to be significantly
  differentially expressed in brains of nurses
  and foragers (21 active in foragers only,
  11 active in nurses only)
 DIPS run on 2Kbp promoters of these
  social behavior-related genes
Results on honey bee genes
Conclusion
 Discriminative motif finding increasingly
  becoming a necessary analysis
 Motif finding in the presence of other
  known motifs also becoming relevant
 A search algorithm that maximizes any
  objective function of the motif counts in
  the sequences
     (as long as its differentiable)
     Several extensions and variations possible
Acknowledgements
 Eric Siggia, Eran Segal
 Yoseph Barash (“LearnPSSM”)
 Andrew Smith (“DME”)
Reference
   ISMB 2006 (Brazil); Bioinformatics
    journal.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:8
posted:12/23/2010
language:English
pages:43
Description: DIPS ppt Counting position weight matrices in a sequence an