VIEWS: 4 PAGES: 49 POSTED ON: 9/1/2012 Public Domain
Genome Evolution. Amos Tanay 2009 Genome evolution: Lecture 11: Transcription factor binding sites Genome Evolution. Amos Tanay 2009 Sequence specific transcription factors • Sequence specific transcription factors (TFs) are a critical part of any gene activation or gene repression machinary • TFs include a DNA binding domain that recognize specifically “regulatory elements” in the genome. • The TF-DNA duplex is then used to target larger transcriptional structure to the genomic locus. Genome Evolution. Amos Tanay 2009 Sequence specificity is represented using consensus sequences or weight matrices • The specificity of the TF binding is central to the understanding of the regulatory relations it can form. • We are therefore interested in defining the DNA motifs that can be recognize by each TF. • A simple representation of the binding motif is the consensus site, usually derived by studying a set of confirmed TF targets and identifying a (partial) consensus. Degeneracy can be introduced into the consensus by using N letters (matching any nucleotide) or IUPAC characters (erpresenting pairs of nucleotides, for exampe W=[A|T], S=[C|G] • A more flexible representation is using weight matrices (PWM/PSSM): 1 2 3 4 5 6 ACGCGT A 60% 20% 0 0 20% 40% ACGCGA C 0 80% 0 100% 0 0 ACGCAT TCGCGA G 0 0 100% 0 80% 0 TAGCGT T 40% 0 0 0 0 60% • PWMs are frequently plotted using motif logos, in which the height of the character correspond to its probability, scaled by the position entropy Genome Evolution. Amos Tanay 2009 TF binding energy is approximated by weight matrices We can interpret weight matrices as energy functions: E ( s) wi [ si ] i wi [ si ] log( pi [ si ]) This linear approximation is reasonable for most TFs. Leu3 data (Liu and Clarke, JMB 2002) Genome Evolution. Amos Tanay 2009 TF binding affinity is kinetically important, with possible functional implications Ume6 11.5 Average PWM energy Stronger prediction • s 5.5 ChIP ranges Stronger binding Tanay. Genome Res 2006 Kalir et al. Science 2001 Genome Evolution. Amos Tanay 2009 TFs are present at only a fraction of their optimal sequence tragets. Binding is regulated by co-factors, nucleosomes and histone modifications An epigenetic signature of promoters (Heinzman et al., 2007) Genome Evolution. Amos Tanay 2009 TFs are present at only a fraction of their optimal sequence tragets. Binding is regulated by co-factors, nucleosomes and histone modifications An epigenetic signature of enhancers (Heinzman et al., 2007) Genome Evolution. Amos Tanay 2009 TFs are present at only a fraction of their optimal sequence tragets. Binding is combinatorially regulated by co-factors, nucleosomes and histone modifications Active Inactive Barski et al. Cell 2007 Genome Evolution. Amos Tanay 2009 Specific proteins are identifying enhancers Here are studies of p300 binding in the developing mouse brain (visel et al. 2009) Genome Evolution. Amos Tanay 2009 TFBSs are clustered in promoters or in “sequence modules” • The distribution of binding sites in the genome is non uniform • In small genomes, most sites are in promoters, and there is a bias toward nucleosome free region near the TSS • In larger genomes (fly) we observe CRM (cis-regulatory-modules) which are frequently away from the TSS. These represent enhancers. • A single binding site, without the context of other co-sites, is unlikely to represent a functional loci Genome Evolution. Amos Tanay 2009 Constructing a weight matrix from aligned TFBSs is trivial • This is done by counting (or “voting”) • Several databases (e.g., TRANSFAC, JASPAR) contain matrices that were constructed from a set of curated and validated binding site • Validated site: usually using “promoter bashing” – testing reported constructs with and without the putative site Transfac 7.0/11.3 have 400/830 different PWMs, based on more than 11,000 papers However, there are no real different 830 matrices outthere – the real binding repertoire in nature is still somewhat unclear Genome Evolution. Amos Tanay 2009 Probabilistic interpretation of weight matrices and a generative model • One can think of a weight matrix as a probabilistic model for binding sites: k P(m) Pi (m[i ]) i 1 • This is the site independent model, defining a probability space over k-mers • Given a set of aligned k-mers, we know that the ML motif model is derived by voting (a set of independent multinomial variables – like the dice case) • Now assume we are given a set of sequences that are supposed to include binding sites (one for each), but that we don’t know where the binding sites are. • In other words the position of the binding site is a hidden variable h. • We introduce a background model Pb that describes the sequence outside of the binding site (usually a d-order Markov model) • Given complete data we can write down the likelihood of a sequence s as: |S | Pback ( s) Pback ( s[i ] | s[i d ..i 1])) i 1 k P( s, l | ) Pback ( s) ( Pi ( s[l i ]) / Pback ( s[l i ] | s[l i d ..l i 1])) i 1 Genome Evolution. Amos Tanay 2009 Using EM to discover PWMs de-novo • Inference of the binding site location posterior: P (l | s, 1 ) P ( s, l | 1 ) / P ( s, i | 1 ) i • Note that only k-factors should be computed for each location (Pb(s) is constant)) • Inference of the binding site location posterior: • Note that only k factors should be computed for each location (Pb(s) is constant)) • Starting with an initial motif model, we can apply a standard EM: E: P (l | s, 1 ) P ( s, l | 1 ) / P ( s, i | 1 ) i M: Pi (c) P(l | s , ) (s j [l i], c) j 1 j l 0..|S | • As always with the EM, initializing to reasonable PWM would be critical Following Baily and Elkan, MEME 1995 Genome Evolution. Amos Tanay 2009 Allowing false positive sequences • If we assume some of the sequences may lack a binding site, this should be incorporated into the model: k P( s, l | ) P(hit ) * Pback ( s ) ( Pi ( s[l i ]) / Pback ( s[l i ] | s[l i d ..l i 1])) i 1 • This is sometime called the ZOOPS model (Zero or one positions) l hit s • In Bayesian terms: – Probability of sequence hit P(hit | S) – Probability of hit at position l = Pr(l|S) • We can consider the PWM parameters as variables in the model • Learning the parameters is then equivalent to inference Genome Evolution. Amos Tanay 2009 Using Gibbs sampling to discover PWMs de-novo • We can use Bibbs sampling to sample the hidden sites and estimate the PWM l hit P(l j | l 1 ,.., l i 1 , l i 1 ,.., l n , S ) s • This is done by estimating the PWM from all locations except for the one we sample, and l computing the hit probabilities as shown before hit • Note that we are working with the MAP (Maximum s a-posteriori) to do the sampling: P (l j | MAP ) l hit MAP arg max L(l 1 ,.., l i 1 , l i 1 ,.., l n | S , ) s • But this can be shown to approximate: P(l j | l 1 ,.., l i 1 , l i 1 ,.., l n , ) Gibbs: Lawrence et al. Science 1993 Genome Evolution. Amos Tanay 2009 Generalizing PWMs to allow site dependencies: mixture of PWMs and Trees Tree motif We only change the motif Mixture of PWMs component of the likelihood model Pback ( s ) P( s[l..l l ] | ) P ( s, l | ) k Pback ( s[l i ] | s[l i d ..l i 1]) i 1 Learning the model can become more difficult This is because computing the ML model parameter from complete data may be challenging Barash et al., RECOMB 2003 Genome Evolution. Amos Tanay 2009 Discriminative scores for motifs • So far we used a generative probabilistic model to learn PWMs • The model was designed to generate the data from parameters • We assumed that TFBSs are distributed differently than some fixed background model • If our background model is wrong, we will get the wrong motifs.. • A different scoring approach try to maximize the discriminative power of the motif model. • We will not go here into the details of discriminative vs. generative models, but we shall exemplify the discriminative approach for PWMs. Lousy discriminator High specificity discriminator High sensitivity discriminator Genome Evolution. Amos Tanay 2009 Hypergeometric scores and thresholding PWMs | A | n | A | k | B | k P(| A B | k ) Hyper geometric probability n (sum for j>=k is the hg p-value) | B | Number of sequences Positive True positive PWM score threshold For a discriminative score, we need to decide on both the PWM model and the threshold. Genome Evolution. Amos Tanay 2009 Exhaustive k-mer search • A very common strategy for motif finding is to do exhustive k-mer search. • Given a set of hits and a set of non hits, we will compute the number of occurrences of each k-mer in the two sets and report all cases that have a discriminative score higher than some threshold • Since k-mers either match or do not match, there is no issue with the threshold • For DNA, we will typically scan k=5-8. • This can be done efficiently using a map/hash: – Iterate on short sequence windows (of the desired k length) – For each window, mark the appearance of the k-mer in a table – Avoid double counting using a second map • It is easy to generalize such exhaustive approaches to include gaps or other types of degeneracy. Genome Evolution. Amos Tanay 2009 Refining k-mers to PWMs using heuristic “EM” • K-mer scan is an excellent intial step for finding refined weight matrices. For example, we can use them to initialize an EM. • If we want to find a weight matrix, but want to stick to the discriminative setting, we can heuristically use and “EM-like” algorithm: – Start with a k-mer seed – Add uniform prior to generate a PWM – Compute the optimal PWM threshold (maximal hyper-geometric score) – Restimate the PWM by voting from all PWM true positives • Consider additional PWM positions • Bound the position entropies to avoid over-fitting – Repeat two last steps until fail to improve score • There are of course no guarantees for improving the scores, but empirically this approach works very well. Genome Evolution. Amos Tanay 2009 High density arrays quantify TF binding preferences and identify binding sites in high throughput • Using microarrays (high resolution tiling arrays) we can now map binding sites in a genome-wide fashion for any genome • The problem is shifting from identifying binding sites to understanding their function and determining how sequences define them Harbison et al., Nature 2004 Genome Evolution. Amos Tanay 2009 Genome Evolution. Amos Tanay 2009 Direct measurments of the in-vitro binding afffinity of 8-mers and DNA binding domains (here just a library of homeodomains, from Berger et al. 2008) Genome Evolution. Amos Tanay 2009 Profiling binding affinity to the entire k-mer spectrum provide direct quantification of in-vitro affinity (Badis et al., 2009) 104 TFs 8-mers Heatmap of 2D hierarchical agglomerative clustering analysis of 4740 ungapped 8- mers over 104 nonredundant TFs, with both 8- mers and proteins clustered using averaged E-score from the two different array designs. Genome Evolution. Amos Tanay 2009 If only biology was that simple… Discrete and deterministic “binding sites” in yeast as identified by Young, Fraenkel and colleuges In fact, binding is rarely deterministic and discrete, and simple wiring is something you should treat with extreme caution. Genome Evolution. Amos Tanay 2009 Correlation between PWM predicted binding and ChIP experiments spans high, medium and low affinity sites r = 0.11 r = 0.28 r = 0.21 r = 0.74 ABF1 r = 0.8 r = 0.72 PWM sequence energy GCN4 PWM sequence energy PWM sequence energy MBP1 14 - 11 - 12 . 5 8 - 13 - 14 . 5 r = 0.42 r = 0.42 r = 0.42 r = 0.26 r = 0.28 2 - 15 r = 0.20 - 16 . 5 -2 2 6 -2 2 6 -2 2 6 ChIP log(binding ratio) ChIP log(binding ratio) ChIP log(binding ratio) PWM regression exploits variable levels of binding affinity to robustly recover binding preferences. Motif regression optimizes the PWM given the overall correlation of the predicted binding energies and the measured ChIP values vs arg max spearman ( F ( s | ), vs ) Tanay, GR 2006 Genome Evolution. Amos Tanay 2009 TFBS evolution: purifying selection and conservation TF1 TF1 Similar function Disrupted function CACGCGTT CACGCGTA CACGCGTT CACGAGTT Neutral evolution Low rate purifying selection TF1 TF2 TF1 Altered function CACGCGTT CACACGTT Altered affinity CACGCGTT CACACGTT Low rate purifying selection Rate? Selection? Genome Evolution. Amos Tanay 2009 Binding sites conservation Kellis et al., 2003 Genome Evolution. Amos Tanay 2009 Binding sites conservation: heuristic motif identification Kellis et al., 2003 Genome Evolution. Amos Tanay 2009 Analyzing k-mer evolutionary dynamics • Instead of trying to identify conserved motifs try to infer the evolutionary rate of substitution between pairs of k-mers • Start from a multiple alignment and reconstruct ancestral sequences (assuming site independence, or even max parsimony) • Now estimate the number of substitution between pairs of 8-mers, compare this number to the number expected by the background model • Do it for a lot of sequence, so that statistics on the difference between observed and expected substitutions can be derived Genome Evolution. Amos Tanay 2009 Saccharomyces TFBS Selection Network Inter-island organization in the Reb1 cluster: selection hints toward multi modality of Reb1 Nodes: octamers node conservation conserved @ 2SD conserved @ 3SD otherwise Arcs: 1nt substitution arc Rate Selection Normal neutral Low negative not enough stat Tanay et al., 2004 Genome Evolution. Amos Tanay 2009 Leu3 selection network Substitution changing high affinity to high affinity motifs 0.3 TF1 0.2 Altered affinity 0.1 Motif 1 Motif 2 Rate? 0 Selection? -5 -4 -3 -2 -1 0 1 2 3 log delta affinity High Affinity Substitution changing (Kd < 60) high affinity to low Meidum Affinity High rate subs. affinity motifs (400 > Kd > 60) Genome Evolution. Amos Tanay 2009 A simple transcriptional code and its evolutionary implications TF5 TF4 AAATTT ACGCGT AATTTT TCGCGT AAAATT ACGCGT TF3 GATGAG TF1 GATGCG CACGTG GATGAT CACTTG TF2 TGACTG TGAGTG TGACTT Genome Evolution. Amos Tanay 2009 The Halpren-Bruno model for selection on affinity The basic notion here is of the relations between sequence, binding and function/fitness Sequence Binding energy E (S ) Function F (E) We argued that E(S) can be approximated by a PWM F(E) is a completely different story, for example: Is there any function at all to low affinity binding sites? Is there a difference between very high affinity and plain strong binding sites? Are all appearances of the site subject to the same fitness landscape? Genome Evolution. Amos Tanay 2009 The Halpren-Bruno model for selection on affinity We work on deriving the substitution rate at each position of the binding site, given its observed stationary frequency. We are assuming that the fitness of the site is defined by multiplying the fitnesses of each locus. This means fitness is generally linear in the binding energy! According to Kimura’s theory, an allele with 1 e 2 s fitness 1 s, s 1 fitness s and a homogeneous population would fixate with probability: 1 e 2 Ns 1 e 2 s 2s f ab 2 Ns Assuming slow mutation rate (which allow us to 1 e 1 e 2 Ns assume a homogenous population) and motifs 1 e2s 2s a and b with relative fitness s the fixation f ab probabilities (chance of fixation given that 1 e 2 Ns 1 e 2 Ns mutation occurred!) are: 2s 1 e 2 Ns e 2 Ns 1 f ab / f ba e 2 Ns 1 e 2 Ns 2s 1 e 2 Ns If p represent the mutation probability, and p the p b pba f ba p p f stationary distribution, and if we assume the 1 b ba ab e 2 Ns process as a whole is reversible then: p a pab f ab p a pab f ba p p p p ln b ba ln b ba p p p p f ab a ab rab c pab a ab p p p p 1 a ab 1 a ab p b pba p b pba (Halpern and Bruno, MBE 1998) Genome Evolution. Amos Tanay 2009 The Halpren-Bruno model for selection on affinity The HB model is extremely limited for the study of general sequences. When restricting the analysis to relatively specific sites, HB is not completely off Moses et al., 2003 Genome Evolution. Amos Tanay 2009 Testing the general binding energy – fitness correspondence Expected and observed energy • While E(S) is approximated by a PWM, F(E) is distribution in E.Coli CRP sites unlikely to be linear (left) and background (right) • Assume that the background probability of a motif a is P0(a). In detailed balance, and assuming the fitness of a at functional sites is F(a), the stationary distribution at sites can be shown to be: Q(a ) Po (a )e 2 NF ( a ) • If we collapse all sites with binding energy E (and hence the same F(a)=F(E(a)) Q( E ) Po ( E )e 2 NF ( E ) • The entire genome should behave like a mixture of background sequance and functional loci: W ( E ) (1 ) Po ( E ) Q( E ) Comparison of CRP energies in Inferred F(E), is shown in Orange E.coli and S. typhimurium • So we can try and recover Q(E) and therefore F(E) from the maximum likelihood parameters fitting an empirical W(E) (Hwa and Gerland, 2000-) Mustonen and Lassig, PNAS 2005 Genome Evolution. Amos Tanay 2009 More tests for possible conservation of low binding energy sites Simulation S. cerevisiae S. mikitae (Neutral, context aware) High affinity ΔE ΔE ΔE ΔE .. .. .. .. 1 0.8 KS statistics 0.6 0.4 Low affinity 0.2 0 0 0.25 0.5 Genome Evolution. Amos Tanay 2009 More tests for possible conservation of low binding energy sites Binding site conservation Conservation of total Reb1 energy S 60 Conservation score 50 40 30 20 S S 10 0 0 50 100 binding energy percentile Ume6 Cbf1 Mbp1 Gcn4 20 20 20 20 Conservation score 15 15 15 15 10 10 10 10 5 5 5 5 0 0 0 0 0 50 100 0 50 100 0 50 100 0 50 100 binding energy percentile binding energy percentile binding energy percentile binding energy percentile Tanay, GR 2006 Genome Evolution. Amos Tanay 2009 Algorithms for discovering conserved sites: • Assuming PWM models – Search for loci that behave as predicted by the model (P(s|TFBS)/P(s|back) > Threshold) – Search for genomic regions with surprisingly many conserved binding sites • Search for an ML PWM – Search for motifs and conserved sites in aligned sequences • Assume the phylogeny, alignment, look for a PWM that will optimize the likelihood of the data – Search for motifs and conserved sites in unaligned sequences • Assume the phylogeny, look for an ML PWM • Search for a general ML evolutionary model (many PWMs) – Search for a set of PWMs/Motif model that will maximize the likelihood of the data Genome Evolution. Amos Tanay 2009 An evolutionary model P(si>si’|si-1) Phylo Tree Neutral model Motif (PWM?) TFBS evo model Substitution probability = background *(fixation) Substitution probability = HB model Genome Evolution. Amos Tanay 2009 The PhyME/PhyloGibbs model (Sinha, Blanchette, Tompa 2004, Sidharthan,Siggia,Nimwegen 2005, based on evo model by Siggia and collegues) our presentation is a bit generalized and adapted.. • Earlier/Other similar approaches using EM but practicxallt • Evolutionary model: – A phylogeny – Neutral independence model (simple tree, probabilities on branches) – A PWM-induced fixation probabilities that are proportional to the matrix weights: paxi xi wk [ xi ] xi paxi Pr(xi | paxi , l (i ) k ) 1 paxi xi wk [ xi ] xi pax xi • Recall the single species generative model k P( s, l | ) Pback ( s ) ( Pi ( s[l i ]) / Pback ( s[l i ] | s[l i d ..l i 1])) i 1 • The generalization to alignments is similar, but should consider a phylogeny at each locus Back Motif Back Genome Evolution. Amos Tanay 2009 Phylogenetic generative model TFBS 1 2 3 4 paxi xi wk [ xi ] xi paxi Pr(xi | paxi , l (i ) k ) 1 paxi xi wk [ xi ] xi pax xi Pr( xi | paxi , l (i ) no) paxi xi Prob of emitting the alignment given the mode Joint probability – product over log Pr( s, l | w, ) Pr( l (i )) Pr( s i | l (i ), w, ) independent trees i Prior of having a motif/background Inference – for loci modes (l(i)) and for ancestral sequences Pr( l (i ) | s, w), Pr( xi | l (i ), s, w) (hierachicaly as in Ex 3) Learning – find ML PWM matrix arg max L( w | S ) Pr(S | w) w Genome Evolution. Amos Tanay 2009 EM for the PWM parameters Key point: the positions in the tree are independent once we decide on their “mode” (background or a certain positions in a binding site) Complete data: the mode of each position (l(i)). The ancestral sequences (h) Joint probability: Pr( s, l | w, ) Pr( l (i )) Pr( s i | l (i ), w, ) i The EM target: Posterior of the missing data Joint probability Q( w | w' ) Pr(l , h | S , w' ) log Pr(l (i)) Pr(x i | paxi , w, l (i)) j j Complete data! l ,h i j We can use the usual trick and decompose this into a sum of independent terms, one for each PWM position. For position k we have: Pr( l (i ) k | s, w' ) log(Pr( l (i ) k ) Pr( h | l (i ) k , s, w' ) log Pr( x ij | paxij , l (i ) k , wk ) i h j log( paxi xi wk [ xi ]) log(Pr(l (i) k ) Pr(l (i) k | S , w' ) Pr(x ij , x ipaj | s, w' , l (i) k )log(1 paxi y wk [ y ]) i j x ij , x ipaj y! x paj i Genome Evolution. Amos Tanay 2009 EM for the PWM parameters Average number of times of we observed the variables log( paxi xi wk [ xi ]) log(Pr(l (i) k ) Pr(l (i) k | S , w' ) Pr(x j , x paj | s, w' , l (i) k )log(1 i i paxi y wk [ y ]) i i i j x j , x paj y! x ipaj Log of the optimized parameter In simple words: •we are optimizing a weighted sum of log(x*w) or log(1-Sxw) •The coefficient for the mutations paxj -> xj at PWM positions k is the average number of times we observe it given the previous parameter set w’ The optimization problem is solved using Lagrange multipliers (to satisfy the constraint on the wk). But because each parameter appears in terms of two forms (log(x*w) and log(1-x*w), the solution is not as trivial as the dice case. Genome Evolution. Amos Tanay 2009 Example for intra-site epistasis AAACGT Assume the following TFBS alignment (where ACGCGT is the motif) ACGCGT According to the simple TFBS evo model, we ACGCGT should “pay” twice for the loss of the ACGCGT “Double site, since each of the two mutations would be Loss” multiplied by a very low fixation probability AAACGT ACGCGT The more realistic scenario involve one mutation that was under pressure, but then neutrality ACGCGT “Loss” Multiple possible trajectories can have different loss/gain/neutrality dynamics AAGCGT ACGCGT “Neutrality” AAACGT Assuming Affinity=PWM=Fitness gives the Halpren Burno When fitness(PWM) is non linear, we have epistasis which model, or the frequency selection approximation means problems for the simple loci-independent model Affinity Affinity PWM PWM Fitness Fitness Genome Evolution. Amos Tanay 2009 An evolutionary model (2) P(si>si’|si-1) Phylo Tree Neutral model TF Target Sets Selection factors (per TF) To fully capture the effect of binding sites, we need to study a Markov model over the entire sequence. Instantaneous rates: background * selection on TFBS changes Mutation rate = background*selection factor *selection factor Genome Evolution. Amos Tanay 2009 Controlling context effects: approximately independent blocks •The sequence is decomposed into intervals, each containing one or more overlapping binding sites, or sequences that are one mutation away from a binding site. Such intervals are called epistatic intervals •The joint probability is written as a product over epistatic intervals •For each epistatic interval we should compute exp(Qt)(from, to), where the rate matrix is large (4d) when d is the interval size •This is approximated by (exp(Qt/N))N Genome Evolution. Amos Tanay 2009 Learning model parameters •Finding the optimal selection factor is solved by non-linear optimization of the likelihood •Looking for target set is done using a greedy algorithm. •The resulted target sets and their selection factors are analogous to motifs/PWMs with some additional evolutionary parameter indicating the strength of selection (or the sufficiency of the motif to determine a functional site) •As shown to the left, similar motifs are sometime separated by significant selection factors, suggesting functional partitioning.