What directs a transcription factor to its target site

Document Sample
What directs a transcription factor to its target site Powered By Docstoc
					What directs a transcription factor to its target site:
    New insights learned from ChIP-Seq data

                    Philipp Bucher

               Wednesday June 10, 2009
          Les Journées Ouvertes en Biologie,
               Nantes, pays de la Loire
                          Program for Today

•   Introduction to transcription factor binding sites
•   Introduction to ChIP-Seq
•   ChIP-Seq analysis for transcription factors
•   in vivo versus in vitro binding specificity of transcription factors
•   How many predicted TF binding sites are occupied in vivo
•   Synergistic interactions between adjacent TF binding sites
•   Sequence conservation and chromatin context of in vivo binding sites
  Facts about transcription factor binding sites

• Length: 6-20 bp
• Degenerate sequence motifs
• Chance occurrence: one site in 250 to 50’000 binding sites
• Up to millions of binding sites in the genome
• Only a few thousand transcription factor molecules per nucleus
• Most predicted sites in the genome presumably not functional
• However, some weak binding sites were shown to be functional
• Synergistic interaction between adjacent binding sites
   Representation of the binding specificity by a scoring
        matrix (also referred to as weight matrix)

                1      2     3     4     5     6     7     8    9

       A   -10       -10   -14   -12   -10     5    -2   -10   -6
       C        5    -10   -13   -13    -7   -15   -13     3   -4
       G       -3    -14   -13   -11     5   -12   -13     2   -7
       T       -5      5     5     5   -10    -9     5   -11    5

Strong          C      T     T     T     G     A     T     C     T
Binding site    5    + 5   + 5   + 5   + 5   + 5   + 5   + 3   + 5 =   43

Random           A     C     G     T     A     C     G     T     A
Sequence       -10   -10   -13   + 5   -10   -15   -13   -11   - 6 = -83
        Physical interpretation of an weight matrix

Weight matrix elements represent relative binding energies between DNA
base-pairs and protein surface areas (base-pair acceptor sites).
A weight matrix column describes the base preferences of a base-pair acceptor
        Berg-von Hippel model of protein-DNA interactions

The weight matrix score expresses the binding     G( x)  S (x)  const.
free energy of a protein-DNA complex in
                                                 S (x)   wi ( xi )
arbitrary units:                                          i 1

It is convenient to express the binding free     E (x)    i ( xi )
                                                            i 1
energy in dimension-less −RT units:                i (b)   wi (b) RT

On a relative scale, the binding constant for      K rel ( x)  e E ( x )
sequence x is then given by:
The energy terms of a weight matrix can be                    1 p (b)
computed from the base frequencies pi(b) estimated  i (b)   ln i
                                                               q(b)
from in vitro or in vivo selected binding sites:
q(b) is the background frequency of base b.
 λ is an unknown parameters related to the
stringency of the binding conditions.
              ChIP-Seq Technique and Data Structure

Our representation: SGA (simple genome annotation) format:

NC_000001.9     stim    559139   -      1    Fields of SGA format:
NC_000001.9     stim    559333   +      1
NC_000001.9     stim    559356   -      1        1. sequence ID
NC_000001.9     stim    559765   -      1
NC_000001.9     stim    559766   +      3        2. feature
NC_000001.9     stim    559767   +      1        3. position
NC_000001.9     stim    559768   +      1        4. strand
NC_000001.9     stim    559777   +      3
NC_000001.9     stim    559778   +      2
                                                 5. Counts
...                                              … optional additional fields
                       Example: STAT1 ChIP-Seq

Input data:
     Stimulated:        15.1 million tags mapped to genome
     Unstimultated:     12.9 million tags mapped to genome
Mapping software Eland (Illumina)
First control experiment: Distribution of 5’ end 3’ tags around 37
experimentally defined STAT1 sites.
Results: see next slides:
     The average length of immunoprecipitated fragments is approx 140 bp
     (distance between 5’ and 3’ peaks)

Data source: Robertson et al. (2007). Genome-wide profiles of STAT1 DNA
association using chromatin immunoprecipitation and massively parallel sequencing.
Nat Methods 4, 651-657.

Bla, bla, bla

Bla, bla, bla
                          Method 1: ChIP-Cor
    • genomic tag count distributions for two features (reference, target)
    • features may be + and − strand tags from same experiments
    • applicable to other types of features, e.g. TSS positions
    • a count correlation histogram
    • computes # of times tag pairs are found at a particular distance (range)
       from each other (large number of tag pairs !).
    • different normalization options:
         • count density of target feature (tags per bp)
         • global → relative target tag frequency (normalized to 1)
    • identification of average fragment size
    • reveals length distribution of enriched domains
    • provides clues for choosing parameters for peak and partitioning
    • Exploratory analysis
    • Rapid quality control
                           Correlation plot: Example 1

Input data:

Ref:      STAT1 5’ tags
Target:   STAT1 3’ tags

Analysis parameters:

Range: −400,+600
Window width: 10
Count cut-off: 3
Y-axis: count-density


Peak center: ~140
Peak count density: 0.03
Background: < 0.007
                           Correlation plot: Example 2

Input data:

Ref:      CTCF 5’ tags
Target:   CTCF 3’ tags

Analysis parameters:

Range: −400,+400
Window width: 5
Count cut-off: 3
Y-axis: count-density


Peak center: ~75
Peak count density: 0.06
Background: < 0.002
                        Method 2: ChIP-Center

    • To map tag counts to expected center position of a protein-DNA
    • Oriented tag counts for a ChIP-Seq features
    • centered, un-oriented tag counts

   • 5’ and 3’ tag positions show relative displacement to each other
   • best estimates for protein-binding site center position:
        5’ end position + ½ fragment length
        or 3’ end position − ½ fragment length
   • centered tag count distribution more useful for
        1) peak recognition and partitioning
        2) data viewing in genome browser
                  Auto-correlation of centered ChIP-Seq tags

Input data:

Ref:    STAT1 centered-70
Target: STAT1 centered-70

Analysis parameters:

Range: −400,+600
Window width: 10
Count cut-off: 3
Y-axis: count-density


Peak center: ~0
Peak count density: 0.06
Peak volume: ~5 counts
Background: < 0.014
                           Method 3: ChIP-peak

     • identification of peaks corresponding to in vivo protein-DNA complexes
     • centered tag counts
     • list of peak center positions
     • consider only positions which have at least one tag count.
     • for each position, determine cumulative tag count in window of width w.
     • select as peaks those positions, which
         • have cumulative tag count ≥ threshold t.
         • are local maximum with range ± r.
     • Optional, refinement of peak center (center of gravity within window)
Interface to sequence analysis programs:
     • download of sequences around peak center positions
   Example ChIP-Peak: Locating in vivo STAT1-binding sites
 Input data:
      Robertson et al. (2007) Nature Methods 4, 651-657.
      Cell material: Interferon γ-stimulated HeLa S3 cells.
      About 15 million tags in total
 Analysis parameters:
      Centering: 70bp, window w=200bp, exclusion range r=200 bp, threshold t=100 counts
 Result: 4446 peaks,
 sequence extraction range for downstream analysis: −1000, 1000

Donwstream sequence
Distribution of
STAT1 peak
Sliding window size 50
Figure produced with
OPROF (SSA server)
       Peak extraction: how to choose the parameters
How to choose the peak width:
   Based on auto-correlation plots

How to choose the threshold ?
   Based on autocorrelation plot
   Based on a statistical model (requires some assumptions)
   By comparison of the results with a control set (see below)
   By measuring the enrichment of consensus binding sites

       Peak threshold    STAT1 stim         STAT1 unstim
                     5         219300             123300
                   10           51809              10185
                   20           13112                 560
                   50            2602                  48
                  100                 853              18
          ChIP-Seq applications to histone modifications

Hypothetical association with
functional chromatin domains

H3K4me3      promoters
H2AZ         promoters
H3K4me1      enhancers
H3K36me3     transcribed regions
H3K27me3     gene silencing
  ChIP-Seq data histone modifications has nucleosome resolution

Based on data: from Barski et al. 2007, Cell 129, 823-837.
ChIP-Seq tags from both strands centered by 70 bp .
WIG file resolution: H3K4me1 50 bp, H3K4me3 10 bp, H2A.Z 25 bp.
 Chromatin domains are differentially occupied in different
                        cell lines

Mouse Nanog region, H3K4me3 profiles.
ES, ESHyb: embryonic stem cell lines, MEF: embryonic fibroblasts, NP neural
Data from Mikkelsen et al. (2007). Genome-wide maps of chromatin state in
pluripotent and lineage-committed cells. Nature 448, 553-560.
       From ChIP-Seq peaks to TF binding site matrices

Some considerations
   Ab initio motif discovery or consensus sequence-initiated refinement ?
   Speed (time complexity) of downstream programs
   Assumptions about binding site: fixed length or variable lengths
   Assumptions about number of different motifs to be found

Some software tools:

  Seed search, EM refinement in one program
  Time complexity O(N 2)
  Can find multiple motifs in on run
HMM training (e.g. MAMOT)
  Needs initial model
  Flexible, can handle palindromes with variable spacing between half-sites
        STAT1 Sequence Motif Defined by ChIP-Seq data

       4446 ChIP peak regions,
       200 bp
   ab initio motif discovery:
       MEME (zoops)

   Matrix from experimental in vivo sites       Matrix from SELEX
       -7 -6 -5 -4 -3 -2 -1                 0       -7 -6 -5 -4 -3 -2 -1         0
   A   23 38 15       0    2 29      8 19       A   6 62 26     2   2    5   2   2
   C   33 17 13       0    0 67 56 31           C   57 13 27    2   3 89 95 48
   G   35 17 12       4    2    4 15 31         G   23 14 10    2   2    2   2 49
   T   10 27 60 96 96           0 21 19         T   14 11 36 95 93       4   2   2

Overall similar matrices, but old data sets too small for definite conclusions
         In vivo and in vitro specificity of CTF/NF1
   (Data from Milos Pjanic and Nicolas Mermod, University of Lausanne)

In vivo specificity: ChIP-Seq                   In vitro specificity:

Cellular source:                                HTP-SELEX
    Mouse embryonic fibroblasts
    Mapping software: Eland (Illumina)          Weight matrix from over
ChIP protocol:                                  >5000 sites
    fixation (formaldehyde) within cells
    DNA extraction
Input data for sequence analysis:
    Total:            9.7 million tags
    within repeats: 3.4 million tags
    unique regions: 6.3 million tags
Identical in vivo and in vitro binding specificity for CTF/NF1

   Training set:        Initial              Training set:
Synthetic DNA,               model:       Mouse genome.
SAGE/SELEX,                               ChIP-seq peaks,
5579 sequences,                           1265 sequences,
Length 25 bp                              Length 200 bp

   EM training:                               EM training:

In vitro specificity:                      In vivo specificity:
Identical in vivo and in vitro binding specificity for CTF/NF1

   Training set:        Initial              Training set:
Synthetic DNA,               model:       Mouse genome.
SAGE/SELEX,                               ChIP-seq peaks,
5579 sequences,                           1265 sequences,
Length 25 bp                              Length 200 bp

   EM training:                               EM training:

In vitro specificity:                      In vivo specificity:
           Selective occupancy of genomic TF target sites

General data processing pipeline:

       ChIP-Seq data                       Annotated list of
                                           target sites:
 Peak finding
 algorithm                                  - genome position
                                            - matrix score
       Peak positions                       - count coverage

                            genome scan
       Binding site                         Potential target
       weight matrix                        sites in genome
Only a small fraction of potential STAT1 sites occupied in vivo!


STAT1 matrix       A: 0 -12 -12 -1 -6      0   0 -8    5  5        2
                   C: 0 -9 -10    5   4    0 -13 -12 -4 -13       -2
                   G: -2 -13 -4 -12 -13    0   4   5 -10 -9        0
                   T: 2    5  5 -8    0    0 -6 -1 -12 -12         0
                               Open question:
    Does STAT1 recognize different types of motifs in vivo!

  Another group has identified two different motifs in the same data sets.

Jothi et al. (2008) Genome-wide identification of in vivo protein-DNA binding sites
from ChIP-Seq data. Nucleic Acids Res. 36:5221.
                  The “helper motif” hypothesis

Helper motifs:
   • occur in the vicinity in vivo occupied motif matches
   • bind to the same or a different transcription factor
   • synergistic effects through protein interactions
   • spacing between motifs may be critical
   • symmetric relationship: motifs help each other
   • interacting motifs are part of cis-regulatory modules (CRMs)
 How to demonstrate synergistic interactions between motifs

    1. define matrix for motif A from ChIP-Seq data
    2. search genome for motif A matches
    3. compile genomic motif A sites with high ChIP tag coverage
    4. as control set, compile high-scoring matches with low coverage
    5. Compute distance correlation plot:
          • motif B (predicted) against motif A in vivo occupied
          • Motif B (predicted) against motif A non-occupied
           Synergistic motif interactions: STAT1 site pairs
    Biological background:
        • STAT proteins can form tetramers
        • known examples of STAT site pairs in functional enhancers
        • STAT1 motif pairs preferentially attract STAT1 protein
        • the optimal distance can be inferred from ChIP-Seq data
    Data sets:
        • genomic sites with score ≥ 30, coverage ≥ 30
        • control: genomic sites with score ≥ 40, coverage = 0.

Example: Il2rα enhancer with 2 STAT5 sites (Lécine et al. 1997, Mol Cell Biol)
Occurrence of STAT1 motifs downstream of occupied STAT1 motifs

   • 6% of in vivo STAT1 sites have second STAT1 motif 21 bp downstream
   • very strong preference for exact spacing of 21 (center-to-center)
   • no second STAT1 sites in control sequences
             Too good to be true! The MER41B story

A number of strange findings led to following discovery:

   •   A rare LTR repetitive element (MER41B) contains STAT1 motifs pairs
   •   Most STAT1 motif pairs with 21 bp distance come from MER41B
   •   Most of the STAT1 site pairs in MER41B are occupied in vivo
   •   MER41B explains motif M2 found by Jothi et al. (2008).
   •   A sizable fraction (>5%) of in STAT1 sites reside in MER41B elements

 Lesson to be learned: mask repeats before analyzing motifs

Total set of in vivo STAT motif pairs: 4307 sites, after masking 2901.
A small peak around 20 remains in the repeat masked set!
STAT1 motif pair analysis in motif repeat-masked sequences

   • a clear but lower peak remains after repeat masking
   • optimal distance between STAT1 pairs 18-21 bp.
   • weaker over-representation of STAT1 sites up to 70 bp downstream
    Correlation of in vivo occupied TFBS conservation with
               genomic and epigenetic properties

Data :
   ChIP peak positions for 15 TFs in mouse ES cells (Chen et al. 2008, Cell)
   Histone modificationsfor mouse ES cells (Mikkelsen et al. 2007, Nature)
   Cross-species conservation scores (UCSC genome browser database)
   Transcription start annotations (ENSEMBL release 50)
   Feature correlation analysis with TFBS peaks as reference positions
   Hierarchical clustering of TFBS
   To identify and assess explanatory variable for in vivo occupancy
   To classify transcription factors based on in vivo binding preferences
               Cross-Species Conservation Patterns

   • peak size: Suz12 >> c-Myc, Zfx > Klf4 > Sox2, Nanog, Smad1 > Ctcf
   • CTCF has very narrow peak → stand-alone elements ?
                        Promoter association

   • C-Myc, Zfx, Khlf4 strongly promoter associated (same peak shape)
   • Suz12 also promoter associated but with broader peak
   • other factors show no or only weak promoter association
                        H3K4me3 association

   • Same trends as for TSS analysis but broader peak
       H3K27me3 association – repressive chromatin mark

   • Strong and broad peak for Suz12
   • No peaks of negative correlations for all other factors
Clustering of TF binding sites based by genomic features

                               TF classes observed:

                               1. Promoter-associated: c-Myc, n-Myc,
                                  E2f1, Zfx, Klf1
                               2. Weakly conserved: Stat3, Tcfcp2l1,
                               3. Nanog/Sox2-like: Nanog, Sox2,
                                  Smad1, p300, Ctcf


                                 Suz12 (repressive function)
                                 Oct4 (clustering artifact ?)
                          Some conclusions

ChIP-Seq makes in vivo TF binding sites visible
ChIP-Seq combined with motif search reaches single bp resolution
Transcription factors appear to have same specificity in vitro and in vivo
Only a small fraction of high-scoring binding sites are occupied in vivo
Analysis of ChIP-Seq data can identify helper motifs
Always use repeat-masked sequences for motif analysis
In vivo sites of transcription factors have distinct:
    • cross-species conservation patterns
    • Preferences for different types of control regions (promoters, others)
    • chromatin context (H3K4me3, H3K27me3, etc)
                                 Many thanks to:

ChIP-Seq server specifically :

    Giovanna Ambrosini,
    Christoph Schmid

Other web servers, data management infrastructure:

    Peter Sperisen
    Viviane Praz

Important ongoing collaborations:

    Milos Pjanic, Nicolas Mermod (UNIL):
    ChIP-Seq for CTF/NF1 in mouse cells

All researchers who made their ChIP-Seq data publicly available!