Genome evolution a sequence-centric approach by dfhdhdhdhjr

VIEWS: 4 PAGES: 49

									                                  Genome Evolution. Amos Tanay 2009




Genome evolution:

   Lecture 11: Transcription factor
            binding sites
                                                                            Genome Evolution. Amos Tanay 2009




Sequence specific transcription factors

•   Sequence specific transcription factors (TFs) are a critical part of any gene activation
    or gene repression machinary
•   TFs include a DNA binding domain that recognize specifically “regulatory elements” in
    the genome.
•   The TF-DNA duplex is then used to target larger transcriptional structure to the
    genomic locus.
                                                                               Genome Evolution. Amos Tanay 2009


Sequence specificity is represented using consensus
sequences or weight matrices
•   The specificity of the TF binding is central to the understanding of the regulatory
    relations it can form.
•   We are therefore interested in defining the DNA motifs that can be recognize by
    each TF.
•   A simple representation of the binding motif is the consensus site, usually derived by
    studying a set of confirmed TF targets and identifying a (partial) consensus.
    Degeneracy can be introduced into the consensus by using N letters (matching any
    nucleotide) or IUPAC characters (erpresenting pairs of nucleotides, for exampe
    W=[A|T], S=[C|G]
•   A more flexible representation is using weight matrices (PWM/PSSM):

                                        1     2      3      4      5     6

               ACGCGT             A     60%   20%    0      0      20%   40%
               ACGCGA
                                  C     0     80%    0      100%   0     0
               ACGCAT
               TCGCGA             G     0     0      100%   0      80%   0
               TAGCGT
                                  T     40%   0      0      0      0     60%



•   PWMs are frequently plotted using motif logos, in which the height of the character
    correspond to its probability, scaled by the position entropy
                                                                 Genome Evolution. Amos Tanay 2009




TF binding energy is approximated by weight matrices




We can interpret weight matrices as
   energy functions:



      E ( s)   wi [ si ]
                  i

      wi [ si ]  log( pi [ si ])

This linear approximation is reasonable
    for most TFs.



                                          Leu3 data (Liu and Clarke, JMB 2002)
                                                                                          Genome Evolution. Amos Tanay 2009



TF binding affinity is kinetically important, with possible functional
implications




                                    Ume6
                      11.5
 Average PWM energy




                                                  Stronger prediction


                                     •   s



                       5.5
                                 ChIP ranges

                               Stronger binding

                             Tanay. Genome Res 2006                     Kalir et al. Science 2001
                                                           Genome Evolution. Amos Tanay 2009
TFs are present at only a fraction of their optimal sequence tragets.
Binding is regulated by co-factors, nucleosomes and histone
modifications




                                                                         An
                                                                         epigenetic
                                                                         signature of
                                                                         promoters
                                                                         (Heinzman
                                                                         et al., 2007)
                                                           Genome Evolution. Amos Tanay 2009
TFs are present at only a fraction of their optimal sequence tragets.
Binding is regulated by co-factors, nucleosomes and histone
modifications




                                                                         An epigenetic
                                                                         signature of
                                                                         enhancers
                                                                         (Heinzman et
                                                                         al., 2007)
                                                                 Genome Evolution. Amos Tanay 2009

TFs are present at only a fraction of their optimal sequence tragets.
Binding is combinatorially regulated by co-factors, nucleosomes and
histone modifications




                                                                  Active


                                                                  Inactive




                                                    Barski et al. Cell 2007
                                                   Genome Evolution. Amos Tanay 2009


                     Specific proteins are identifying enhancers
Here are studies of p300 binding in the developing mouse brain
                                               (visel et al. 2009)
                                                                               Genome Evolution. Amos Tanay 2009



TFBSs are clustered in promoters or in “sequence modules”




•   The distribution of binding sites in the genome is non uniform
•   In small genomes, most sites are in promoters, and there is a bias toward
    nucleosome free region near the TSS
•   In larger genomes (fly) we observe CRM (cis-regulatory-modules) which are
    frequently away from the TSS. These represent enhancers.
•   A single binding site, without the context of other co-sites, is unlikely to represent a
    functional loci
                                                            Genome Evolution. Amos Tanay 2009




Constructing a weight matrix from aligned TFBSs is trivial


• This is done by counting (or “voting”)
• Several databases (e.g., TRANSFAC, JASPAR) contain matrices
  that were constructed from a set of curated and validated binding
  site
• Validated site: usually using “promoter bashing” – testing reported
  constructs with and without the putative site

Transfac 7.0/11.3 have 400/830 different PWMs, based on more than
   11,000 papers

However, there are no real different 830 matrices outthere – the real
  binding repertoire in nature is still somewhat unclear
                                                                                                 Genome Evolution. Amos Tanay 2009


Probabilistic interpretation of weight matrices and a
generative model

•   One can think of a weight matrix as a probabilistic model for binding sites:
                                                   k
                                      P(m)   Pi (m[i ])
                                                  i 1

•   This is the site independent model, defining a probability space over k-mers
•   Given a set of aligned k-mers, we know that the ML motif model is derived by voting
    (a set of independent multinomial variables – like the dice case)

•   Now assume we are given a set of sequences that are supposed to include binding
    sites (one for each), but that we don’t know where the binding sites are.
•   In other words the position of the binding site is a hidden variable h.
•   We introduce a background model Pb that describes the sequence outside of the
    binding site (usually a d-order Markov model)
•   Given complete data we can write down the likelihood of a sequence s as:
                          |S |
            Pback ( s)   Pback ( s[i ] | s[i  d ..i  1]))
                          i 1
                                       k
            P( s, l |  )  Pback ( s) ( Pi ( s[l  i ]) / Pback ( s[l  i ] | s[l  i  d ..l  i  1]))
                                      i 1
                                                                                                  Genome Evolution. Amos Tanay 2009




Using EM to discover PWMs de-novo


•       Inference of the binding site location posterior:
                                P (l | s,  1 )  P ( s, l |  1 ) /  P ( s, i |  1 )
                                                                        i

•       Note that only k-factors should be computed for each location (Pb(s) is constant))

•       Inference of the binding site location posterior:
•       Note that only k factors should be computed for each location (Pb(s) is constant))

•       Starting with an initial motif model, we can apply a standard EM:

                       E:              P (l | s,  1 )  P ( s, l |  1 ) /  P ( s, i |  1 )
                                                                                i

                       M:             Pi (c)          P(l | s , ) (s j [l  i], c)
                                                                    j       1

                                                  j l 0..|S |




    •    As always with the EM, initializing to reasonable PWM would be critical


                                                                                      Following Baily and Elkan, MEME 1995
                                                                                                    Genome Evolution. Amos Tanay 2009




Allowing false positive sequences


•   If we assume some of the sequences may lack a binding site, this should be
    incorporated into the model:
                                                k
         P( s, l |  )  P(hit ) * Pback ( s ) ( Pi ( s[l  i ]) / Pback ( s[l  i ] | s[l  i  d ..l  i  1]))
                                               i 1


•   This is sometime called the ZOOPS model (Zero or one positions)

                                                      l
                                     hit                          


                                                      s

•   In Bayesian terms:
     –   Probability of sequence hit P(hit | S)
     –   Probability of hit at position l = Pr(l|S)
•   We can consider the PWM parameters as variables in the model
•   Learning the parameters is then equivalent to inference
                                                                                         Genome Evolution. Amos Tanay 2009




Using Gibbs sampling to discover PWMs de-novo

•   We can use Bibbs sampling to sample the hidden
    sites and estimate the PWM                                                     l
                                                                         hit                   

                       P(l j | l 1 ,.., l i 1 , l i 1 ,.., l n , S )
                                                                                   s
•   This is done by estimating the PWM from all
    locations except for the one we sample, and                                    l
    computing the hit probabilities as shown before                      hit


•   Note that we are working with the MAP (Maximum                                 s
    a-posteriori)  to do the sampling:

     P (l j |  MAP )                                                              l
                                                                         hit
     MAP  arg max L(l 1 ,.., l i 1 , l i 1 ,.., l n | S ,  )
                           

                                                                                   s
•   But this can be shown to approximate:

      P(l j | l 1 ,.., l i 1 , l i 1 ,.., l n , )
                                                                               Gibbs: Lawrence et al. Science 1993
                                                                                         Genome Evolution. Amos Tanay 2009


Generalizing PWMs to allow site dependencies: mixture of
PWMs and Trees



                                               Tree motif




                                                                                         We only change the motif
                                              Mixture of PWMs                                component of the
                                                                                             likelihood model




                                                                        Pback ( s )  P( s[l..l  l ] |  )
                                            P ( s, l |  ) 
                                                                k                                               
                                                                 Pback ( s[l  i ] | s[l  i  d ..l  i  1]) 
                                                                                                                
                                                                i 1                                            
Learning the model can become more difficult
This is because computing the ML model parameter from complete
    data may be challenging                                                              Barash et al., RECOMB 2003
                                                                                    Genome Evolution. Amos Tanay 2009




Discriminative scores for motifs

•   So far we used a generative probabilistic model to learn PWMs
•   The model was designed to generate the data from parameters
•   We assumed that TFBSs are distributed differently than some fixed background
    model

•   If our background model is wrong, we will get the wrong motifs..

•   A different scoring approach try to maximize the discriminative power of the motif
    model.
•   We will not go here into the details of discriminative vs. generative models, but we
    shall exemplify the discriminative approach for PWMs.




        Lousy discriminator         High specificity discriminator   High sensitivity discriminator
                                                                                                      Genome Evolution. Amos Tanay 2009




Hypergeometric scores and thresholding PWMs


                                                                         | A |  n | A | 
                                                                        
                                                                         k  | B | k 
                                                                                          
                                                     P(| A  B | k )                      Hyper geometric probability
                                                                                n             (sum for j>=k is the hg p-value)
                                                                               
                                                                               | B |
                                                                                     
                                                                                    
                   Number of sequences




                                                                                          Positive
                                                                                          True positive




                                         PWM score threshold




 For a discriminative score, we need to decide on both the PWM model and the
     threshold.
                                                              Genome Evolution. Amos Tanay 2009




Exhaustive k-mer search


• A very common strategy for motif finding is to do exhustive k-mer
  search.
• Given a set of hits and a set of non hits, we will compute the number
  of occurrences of each k-mer in the two sets and report all cases
  that have a discriminative score higher than some threshold
• Since k-mers either match or do not match, there is no issue with
  the threshold
• For DNA, we will typically scan k=5-8.
• This can be done efficiently using a map/hash:
    – Iterate on short sequence windows (of the desired k length)
    – For each window, mark the appearance of the k-mer in a table
    – Avoid double counting using a second map
• It is easy to generalize such exhaustive approaches to include gaps
  or other types of degeneracy.
                                                                     Genome Evolution. Amos Tanay 2009




Refining k-mers to PWMs using heuristic “EM”

•   K-mer scan is an excellent intial step for finding refined weight matrices. For
    example, we can use them to initialize an EM.
•   If we want to find a weight matrix, but want to stick to the discriminative
    setting, we can heuristically use and “EM-like” algorithm:

     –   Start with a k-mer seed
     –   Add uniform prior to generate a PWM
     –   Compute the optimal PWM threshold (maximal hyper-geometric score)
     –   Restimate the PWM by voting from all PWM true positives
          • Consider additional PWM positions
          • Bound the position entropies to avoid over-fitting
     – Repeat two last steps until fail to improve score


•   There are of course no guarantees for improving the scores, but empirically
    this approach works very well.
                                                                             Genome Evolution. Amos Tanay 2009


High density arrays quantify TF binding preferences and
identify binding sites in high throughput




•   Using microarrays (high resolution tiling arrays) we can now map binding sites in a
    genome-wide fashion for any genome
•   The problem is shifting from identifying binding sites to understanding their function
    and determining how sequences define them
                                                                          Harbison et al., Nature 2004
Genome Evolution. Amos Tanay 2009
                                                                         Genome Evolution. Amos Tanay 2009




Direct measurments of the in-vitro binding afffinity of 8-mers and DNA binding domains
(here just a library of homeodomains, from Berger et al. 2008)
                                                                  Genome Evolution. Amos Tanay 2009



    Profiling binding affinity to the entire k-mer spectrum provide direct
    quantification of in-vitro affinity (Badis et al., 2009)
                                         104 TFs




8-mers                    Heatmap of 2D hierarchical
                          agglomerative clustering
                          analysis of 4740 ungapped 8-
                          mers over 104 nonredundant
                          TFs, with both 8- mers and
                          proteins clustered using
                          averaged E-score from the
                          two different array designs.
                                                                             Genome Evolution. Amos Tanay 2009

If only biology was that simple…


Discrete and
deterministic “binding
sites” in yeast as
identified by Young,
Fraenkel and colleuges




In fact, binding is rarely deterministic and discrete, and simple wiring is something you
should treat with extreme caution.
                                                                                                                                                                       Genome Evolution. Amos Tanay 2009


                        Correlation between PWM predicted binding and ChIP experiments spans high,
                        medium and low affinity sites
                            r = 0.11                                                        r = 0.28                                                             r = 0.21
                            r = 0.74    ABF1                                                r = 0.8                                                              r = 0.72




                                                                                                                                PWM sequence energy
                                                                                                       GCN4




                                                               PWM sequence energy
 PWM sequence energy




                                                                                                                                                                              MBP1

                       14                                                            - 11
                                                                                                                                                      - 12 . 5




                        8                                                            - 13
                                                                                                                                                      - 14 . 5
                                                r = 0.42                                                                                                                                 r = 0.42
                                                                                                              r = 0.42
                                                r = 0.26                                                                                                                                 r = 0.28
                        2                                                            - 15
                                                                                                              r = 0.20                                - 16 . 5

                            -2             2               6                                -2            2                 6                                    -2               2                 6

                                 ChIP log(binding ratio)
                                                                                                  ChIP log(binding ratio)                                             ChIP log(binding ratio)



                       PWM regression exploits variable levels of binding affinity to robustly recover
                       binding preferences.


Motif regression optimizes the PWM given the overall correlation of
the predicted binding energies and the measured ChIP values vs


                                   arg max spearman ( F ( s |  ), vs )


                                                                                                                                                                                Tanay, GR 2006
                                                             Genome Evolution. Amos Tanay 2009




TFBS evolution: purifying selection and conservation


              TF1
                                                     TF1

       Similar function
                                            Disrupted function
CACGCGTT                 CACGCGTA
                                     CACGCGTT                    CACGAGTT
       Neutral evolution
                                                Low rate
                                           purifying selection
             TF1          TF2
                                                   TF1
      Altered function
CACGCGTT                  CACACGTT          Altered affinity
                                     CACGCGTT                    CACACGTT
            Low rate
       purifying selection                        Rate?
                                                Selection?
                             Genome Evolution. Amos Tanay 2009




Binding sites conservation




                                      Kellis et al., 2003
                                                    Genome Evolution. Amos Tanay 2009


Binding sites conservation: heuristic motif identification




                                                             Kellis et al., 2003
                                                              Genome Evolution. Amos Tanay 2009




Analyzing k-mer evolutionary dynamics

• Instead of trying to identify conserved motifs try to infer the
  evolutionary rate of substitution between pairs of k-mers

• Start from a multiple alignment and reconstruct ancestral sequences
  (assuming site independence, or even max parsimony)

• Now estimate the number of substitution between pairs of 8-mers,
  compare this number to the number expected by the background
  model

• Do it for a lot of sequence, so that statistics on the difference
  between observed and expected substitutions can be derived
                                                                  Genome Evolution. Amos Tanay 2009


   Saccharomyces TFBS Selection Network


                                    Inter-island organization in
                                  the Reb1 cluster: selection hints
                                   toward multi modality of Reb1



 Nodes: octamers
node      conservation
         conserved @ 2SD
        conserved @ 3SD
           otherwise


Arcs: 1nt substitution
  arc Rate           Selection
        Normal         neutral
          Low          negative
          not enough stat

Tanay et al., 2004
                                                                Genome Evolution. Amos Tanay 2009



Leu3 selection network


                                                             Substitution changing
                                                              high affinity to high
                                                                 affinity motifs

                                               0.3
                                    TF1
                                               0.2

                            Altered affinity
                                               0.1
                       Motif 1                Motif 2
                                   Rate?        0
                                 Selection?          -5 -4 -3 -2 -1 0 1 2 3
                                                          log delta affinity



 High Affinity
                                              Substitution changing
 (Kd < 60)
                                              high affinity to low
Meidum Affinity   High rate subs.
                                              affinity motifs
(400 > Kd > 60)
                                                     Genome Evolution. Amos Tanay 2009



A simple transcriptional code and its evolutionary
implications


                                TF5       TF4
                    AAATTT
                                       ACGCGT
                    AATTTT
                                       TCGCGT
                    AAAATT
                                       ACGCGT
          TF3
                GATGAG      TF1
                GATGCG                CACGTG
                GATGAT                CACTTG
                          TF2
                         TGACTG
                         TGAGTG
                         TGACTT
                                                                                    Genome Evolution. Amos Tanay 2009




The Halpren-Bruno model for selection on affinity


  The basic notion here is of the relations between sequence, binding and function/fitness

              Sequence
              Binding energy                        E (S )
              Function                              F (E)




  We argued that E(S) can be approximated by a PWM

  F(E) is a completely different story, for example:
            Is there any function at all to low affinity binding sites?
            Is there a difference between very high affinity and plain strong binding sites?
            Are all appearances of the site subject to the same fitness landscape?
                                                                                                         Genome Evolution. Amos Tanay 2009




      The Halpren-Bruno model for selection on affinity
          We work on deriving the substitution rate at each position of the binding site, given its observed
          stationary frequency. We are assuming that the fitness of the site is defined by multiplying the
          fitnesses of each locus. This means fitness is generally linear in the binding energy!

         According to Kimura’s theory, an allele with             1  e 2 s                   fitness  1  s, s  1
         fitness s and a homogeneous population would
         fixate with probability:                                 1  e  2 Ns
                                                                           1  e 2 s         2s
                                                                  f ab          2 Ns
                                                                                        
        Assuming slow mutation rate (which allow us to                     1 e           1  e  2 Ns
        assume a homogenous population) and motifs
                                                                           1  e2s          2s
        a and b with relative fitness s the fixation              f ab                
        probabilities (chance of fixation given that                       1 e 2 Ns
                                                                                         1  e 2 Ns
        mutation occurred!) are:                                                   2s          1  e 2 Ns e 2 Ns  1
                                                                  f ab / f ba                                         e 2 Ns
                                                                                1  e  2 Ns      2s      1  e  2 Ns
         If p represent the mutation probability, and p the         p b pba f ba      p p     f
         stationary distribution, and if we assume the                            1  b ba  ab  e 2 Ns
         process as a whole is reversible then:
                                                                    p a pab f ab      p a pab f ba


                                              p p                                    p p 
                                           ln  b ba                               ln  b ba 
                                                                                       p p 
                                              p p 
                                     f ab   a ab                  rab  c  pab   a ab 
                                               p p                                      p p
                                            1  a ab                                 1  a ab
                                               p b pba                                  p b pba
(Halpern and Bruno, MBE 1998)
                                                                             Genome Evolution. Amos Tanay 2009




The Halpren-Bruno model for selection on affinity

 The HB model is extremely limited for the study of general sequences.
 When restricting the analysis to relatively specific sites, HB is not completely off




                                                                                     Moses et al., 2003
                                                                                                   Genome Evolution. Amos Tanay 2009



    Testing the general binding energy – fitness correspondence

                                                            Expected and observed energy
•    While E(S) is approximated by a PWM, F(E) is           distribution in E.Coli CRP sites
     unlikely to be linear                                  (left) and background (right)



•    Assume that the background probability of a
     motif a is P0(a). In detailed balance, and
     assuming the fitness of a at functional sites is
     F(a), the stationary distribution at sites can be
     shown to be:
                         Q(a )  Po (a )e 2 NF ( a )

•    If we collapse all sites with binding energy E
     (and hence the same F(a)=F(E(a))

                       Q( E )  Po ( E )e 2 NF ( E )

•    The entire genome should behave like a
     mixture of background sequance and functional
     loci:
               W ( E )  (1   ) Po ( E )  Q( E )
                                                                                               Comparison of CRP energies in
                                                         Inferred F(E), is shown in Orange     E.coli and S. typhimurium
•    So we can try and recover Q(E) and therefore
     F(E) from the maximum likelihood parameters
     fitting an empirical W(E)                                                          (Hwa and Gerland, 2000-)
                                                                                        Mustonen and Lassig, PNAS 2005
                                                                           Genome Evolution. Amos Tanay 2009




More tests for possible conservation of low binding energy
sites

            Simulation                      S. cerevisiae          S. mikitae
       (Neutral, context aware)

                                             High affinity
  ΔE                                                                                    ΔE
  ΔE                                                                                    ΔE
  ..                                                                                    ..
  ..                                                                                    ..




                                   1

                                  0.8
                                            KS statistics
                                  0.6

                                  0.4        Low affinity
                                  0.2

                                   0
                                        0         0.25       0.5
                                                                                                                                                                                    Genome Evolution. Amos Tanay 2009


 More tests for possible conservation of low binding energy
 sites
                                                                       Binding site
                                                                       conservation



                                                  Conservation
                                                  of total                                                                                                                                Reb1
                                                  energy                                       S                                                                       60




                                                                                                                                                  Conservation score
                                                                                                                                                                       50
                                                                                                                                                                       40
                                                                                                                                                                       30
                                                                                                                                                                       20

                                                                             S                                   S                                                     10
                                                                                                                                                                        0
                                                                                                                                                                            0               50              100
                                                                                                                                                                                  binding energy percentile


                                     Ume6                                     Cbf1                                        Mbp1                                                         Gcn4
                     20                                       20                                       20                                                      20
Conservation score




                     15                                       15                                       15                                                      15

                     10                                       10                                       10                                                      10

                      5                                        5                                        5                                                              5

                      0                                        0                                        0                                                              0
                          0             50              100        0             50              100        0              50               100                             0              50               100
                              binding energy percentile                binding energy percentile                binding energy percentile                                       binding energy percentile




                                                                                                                                                                                              Tanay, GR 2006
                                                                     Genome Evolution. Amos Tanay 2009




Algorithms for discovering conserved sites:


• Assuming PWM models
   – Search for loci that behave as predicted by the model
     (P(s|TFBS)/P(s|back) > Threshold)
   – Search for genomic regions with surprisingly many conserved binding
     sites
• Search for an ML PWM
   – Search for motifs and conserved sites in aligned sequences
       • Assume the phylogeny, alignment, look for a PWM that will optimize the
         likelihood of the data
   – Search for motifs and conserved sites in unaligned sequences
       • Assume the phylogeny, look for an ML PWM
• Search for a general ML evolutionary model (many PWMs)
   – Search for a set of PWMs/Motif model that will maximize the likelihood
     of the data
                                                                Genome Evolution. Amos Tanay 2009



An evolutionary model




             P(si>si’|si-1)


Phylo Tree   Neutral model           Motif (PWM?)           TFBS evo model




                              Substitution probability = background *(fixation)




                              Substitution probability = HB model
                                                                                                  Genome Evolution. Amos Tanay 2009



The PhyME/PhyloGibbs model (Sinha, Blanchette, Tompa 2004,
Sidharthan,Siggia,Nimwegen 2005, based on evo model by Siggia and collegues)

our presentation is a bit generalized and adapted..

•   Earlier/Other similar approaches using EM but practicxallt
•   Evolutionary model:
     –   A phylogeny
     –   Neutral independence model (simple tree, probabilities on branches)
     –   A PWM-induced fixation probabilities that are proportional to the matrix weights:

                                                     paxi  xi wk [ xi ]   xi  paxi
                                                   
                     Pr(xi | paxi , l (i )  k )  1  
                                                   
                                                        paxi  xi wk [ xi ] xi  pax
                                                      xi
•   Recall the single species generative model

                                                 k
                     P( s, l |  )  Pback ( s ) ( Pi ( s[l  i ]) / Pback ( s[l  i ] | s[l  i  d ..l  i  1]))
                                                i 1


•   The generalization to alignments is
    similar, but should consider a
    phylogeny at each locus
                                                                 Back                 Motif               Back
                                                                                                              Genome Evolution. Amos Tanay 2009




    Phylogenetic generative model
                                                            TFBS

                                                 1        2          3         4




                                                                    paxi  xi wk [ xi ]   xi  paxi
                                                                  
                                    Pr(xi | paxi , l (i )  k )  1  
                                                                  
                                                                       paxi  xi wk [ xi ] xi  pax
                                                                     xi


                                    Pr( xi | paxi , l (i )  no)   paxi  xi
                                                                                   Prob of emitting the alignment given the mode
Joint probability –
product over                        log Pr( s, l | w,  )   Pr( l (i )) Pr( s i | l (i ), w,  )
independent trees                                                 i
                                                              Prior of having a motif/background
Inference – for loci modes (l(i))
and for ancestral sequences          Pr( l (i ) | s, w), Pr( xi | l (i ), s, w)
(hierachicaly as in Ex 3)

Learning – find ML PWM matrix
                                    arg max L( w | S )  Pr(S | w)
                                         w
                                                                                                                    Genome Evolution. Amos Tanay 2009




     EM for the PWM parameters
        Key point: the positions in the tree are independent once we decide on their “mode”
        (background or a certain positions in a binding site)

        Complete data: the mode of each position (l(i)). The ancestral sequences (h)

        Joint probability:
                               Pr( s, l | w,  )   Pr( l (i )) Pr( s i | l (i ), w,  )
                                                       i
         The EM target:
                                                Posterior of the missing data                     Joint probability
                                                                                                             
                        Q( w | w' )       Pr(l , h | S , w' ) log   Pr(l (i)) Pr(x i | paxi , w, l (i))  
                                                                               j          j
                                                                                                             
                                                                                                              
                   Complete data!     l ,h
                                                                     i 
                                                                                 j                           
                                                                                                               
         We can use the usual trick and decompose this into a sum of independent terms, one
         for each PWM position. For position k we have:
                                                                                                                         
 Pr( l (i )  k | s, w' ) log(Pr( l (i )  k ) Pr( h | l (i )  k , s, w' ) log  Pr( x ij | paxij , l (i )  k , wk ) 
                                                                                                                             
i                                               h                                   j                                    

                                                                                       log(  paxi  xi wk [ xi ]) 
                                                                                                                          
log(Pr(l (i)  k )  Pr(l (i)  k | S , w' )  Pr(x ij , x ipaj | s, w' , l (i)  k )log(1   
                                                                                                         paxi  y wk [ y ]) 
                  i                         j x ij , x ipaj                                                               
                                                                                             y! x paj
                                                                                                                            
                                                                                                    i
                                                                                                   Genome Evolution. Amos Tanay 2009




    EM for the PWM parameters
                         Average number of times of we observed the variables
                                                                                     log(  paxi  xi wk [ xi ]) 
                                                                                                                         
log(Pr(l (i)  k )  Pr(l (i)  k | S , w' )  Pr(x j , x paj | s, w' , l (i)  k )log(1   
                                                           i i
                                                                                                        paxi  y wk [ y ]) 
                  i                            i     i
                                                                                                                          
                                                                                     
                                             j x j , x paj
                                                                                            y! x ipaj
                                                                                                                           
                                                                                        Log of the optimized parameter

      In simple words:
           •we are optimizing a weighted sum of log(x*w) or log(1-Sxw)

             •The coefficient for the mutations paxj -> xj at PWM positions k is the average
             number of times we observe it given the previous parameter set w’


      The optimization problem is solved using Lagrange multipliers (to satisfy the constraint
      on the wk). But because each parameter appears in terms of two forms (log(x*w) and
      log(1-x*w), the solution is not as trivial as the dice case.
                                                                                         Genome Evolution. Amos Tanay 2009




Example for intra-site epistasis

                                                                                         AAACGT
 Assume the following TFBS alignment (where ACGCGT is the motif)
                                                                                         ACGCGT
 According to the simple TFBS evo model, we                                              ACGCGT
 should “pay” twice for the loss of the ACGCGT                            “Double
 site, since each of the two mutations would be                            Loss”
 multiplied by a very low fixation probability                                 AAACGT               ACGCGT

 The more realistic scenario involve one mutation
 that was under pressure, but then neutrality                                            ACGCGT
                                                                              “Loss”
 Multiple possible trajectories can have different
 loss/gain/neutrality dynamics                                                   AAGCGT             ACGCGT
                                                                  “Neutrality”
                                                                           AAACGT


 Assuming Affinity=PWM=Fitness gives the Halpren Burno   When fitness(PWM) is non linear, we have epistasis which
 model, or the frequency selection approximation         means problems for the simple loci-independent model


                        Affinity                                              Affinity

                        PWM                                                      PWM

                       Fitness                                                         Fitness
                                                               Genome Evolution. Amos Tanay 2009



 An evolutionary model (2)




                 P(si>si’|si-1)


 Phylo Tree      Neutral model        TF Target Sets       Selection factors
                                                               (per TF)
To fully capture the effect of binding sites, we need to study a Markov
model over the entire sequence.

Instantaneous rates: background * selection on TFBS changes

                                   Mutation rate = background*selection factor
                                                             *selection factor
                                                                    Genome Evolution. Amos Tanay 2009

Controlling context effects: approximately independent blocks




                                         •The sequence is decomposed into
                                         intervals, each containing one or
                                         more overlapping binding sites, or
                                         sequences that are one mutation
                                         away from a binding site. Such
                                         intervals are called epistatic
                                         intervals

                                         •The joint probability is written as a
                                         product over epistatic intervals

                                         •For each epistatic interval we should
                                         compute exp(Qt)(from, to), where the
                                         rate matrix is large (4d) when d is the
                                         interval size

                                         •This is approximated by (exp(Qt/N))N
                                           Genome Evolution. Amos Tanay 2009

Learning model parameters




                            •Finding the optimal selection factor is
                            solved by non-linear optimization of
                            the likelihood

                            •Looking for target set is done using a
                            greedy algorithm.

                            •The resulted target sets and their
                            selection factors are analogous to
                            motifs/PWMs with some additional
                            evolutionary parameter indicating the
                            strength of selection (or the
                            sufficiency of the motif to determine a
                            functional site)

                            •As shown to the left, similar motifs
                            are sometime separated by significant
                            selection factors, suggesting
                            functional partitioning.

								
To top