VERTEBRATE GENOME EVOLUTION AND FUNCTION

Document Sample
VERTEBRATE GENOME EVOLUTION AND FUNCTION Powered By Docstoc
					  Evolutionary and genomic approaches to
      find gene regulatory sequences


Penn State University, Center for Comparative Genomics and
    Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton
    Nekrutenko, Kateryna Makova, Stephan Schuster, Ross Hardison
University of California at Santa Cruz: David Haussler, Jim Kent
Children’s Hospital of Philadelphia: Mitch Weiss
NimbleGen: Roland Green



                                 University of Nebraska, Lincoln February 14. 2007
     Major goals of comparative genomics

• Identify all DNA sequences in a genome that are
  functional
   – Selection to preserve function
   – Adaptive selection
• Determine the biological role of each functional sequence
• Elucidate the evolutionary history of each type of
  sequence
• Provide bioinformatic tools so that anyone can easily
  incorporate insights from comparative genomics into their
  research
     Known types of gene regulatory regions




G.A. Maston, S.K. Evans, M.R. Green (2006) Ann. Rev. Genomics & Human Genetics 7:29-59.
Regulatory regions tend to be clusters of
   transcription factor binding sites




                     Sequence-specific




          SV40 promoters and enhancer
    Properties of known regulatory regions


• Binding sites for transcription factors, many with
  sequence specificity
• Clusters of binding sites
• Conventional promoters encompass major start
  sites for transcription
• Conserved over evolutionary time???
       Structures involved in transcription are probably
                        more complex




      Middle image:
        Green: active transcription (Br-UTP label)
        Red: all nucleic acids
        HeLa cell
      Sides: EM spreads of transcripts



Peter R. Cook, Oxford University,
http://users.path.ox.ac.uk/~pcook/images/Images.html
 Domain opening is associated with movement to non-
              heterochromatic regions




Schubeler, Francastel, Cimbora, Reik, Martin, Groudine (2000) Genes & Dev. 14: 940-950
     Other possible activities for sequences
          involved in gene regulation

• Opening or closing a chromosomal domain
• Move a gene to or away from a transcription factory
• Control how long a gene is in a transcription factory
    – Long association
       • High level expression
       • Really long gene
    – Short association
       • Lower level expression
       • Rapid regulation
•   Are these conserved over evolutionary time?
                         3 modes of evolution




Sequence matches at longer phylogenetic distances could reflect purifying selection
Sequence differences at closer phylogenetic distances could reflect adaptive evolution.
              Conservation vs. Constraint


• Conserved sequences are those that align between two
  species thought to be descended from a common
  ancestor
• Constrained sequences show evidence in their
  alignments of negative (purifying) selection
   – E.g. change at a rate significantly slower than “neutral”
      DNA
                        Ideal cases for interpretation
                   Human vs mouse

                                                              Negative selection
                                                              (purifying)
      Similarity
                                                            Neutral DNA




                   Human vs rhesus
                                                            Neutral DNA
      Similarity
                                                              Positive selection
                                                              (adaptive)
 P (not neutral)
                                                            Neutral DNA

                          Position along chromosome




DNA segments with a function common to          DNA segments in which change is beneficial to
divergent species.                              at least one of the two species.
    Messages about evolutionary approaches to
          predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
  phylogenetic distance.
• Incorporation of pattern and composition information
  along with with conservation can lead to effective
  discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
  GATA-1 binding motif is an effective predictor of enhancer
  activity.
• In vivo occupancy by GATA-1 suggests other activities in
  addition to enhancers.
• Comparison of polymorphism and divergence from closely
  related species can reveal regulatory regions that are
  under recent selection.
        Finding all gene regulatory regions is a
         challenge for comparative genomics




•   Known regulatory regions for the HBB complex
•   23 total
•   19 conserved (align) between human and mouse
•   Many others show no significant difference in a measure
    of constraint (phastCons) from the bulk or neutral DNA
   Two
extremes
     of
constraint
 in TRRs
                      ENCODE projects

• ENCODE (ENCyclopedia Of DNA Elements): consortium
  aiming to find function for all human DNA sequences
   – Phase I focused on 1% of human DNA
   – 30 Mb, 44 regions
        • About 10 regions had known genes of interest (CFTR, HOX)
        • Others were chosen to get a sampling of regions varying in gene
          density and alignability with mouse
• Major areas
   –   Genes and transcripts
   –   Transcriptional regulation
   –   Chromatin structure
   –   Multiple sequence alignment
   –   Variation in human populations
Biochemical assays for protein-binding sites in DNA




  Purified protein   Chromatin Immunoprecipitation:
  & Naked DNA        DNA sites occupied by a protein
                       inside cells.
ChIP-on-chip to examine many sites
 Putative transcriptional regulatory regions = pTRRs

• Antibodies vs 10 sequence-specific factors:
   – Sp1, Sp3, E2F1, E2F4, cMyc, STAT1, cJun, CEBPe, PU1, RA
     Receptor A
   – High resolution ChIP-chip platforms: Affymetrix and NimbleGen
   – Data from several different labs in ENCODE consortium
• High likelihood hits for ChIP-chip
   – 5% false discovery rate
• Supported by chromatin modification data
   – Modified histones in chromatin: H4Ac, H3Ac, H3K4me, H3K4me2,
     H3K4me3, etc.
   – DNase hypersensitive sites (DHSs) or nucleosome depleted sites
• Result: set of 1369 pTRRs
       A small fraction of cis-regulatory modules are
           conserved from human to chicken

  Millions of                  • About 4% of pTRRs, 4% of
  years               91         DNase HSs, 4-7% of promoters
                                 active in multiple cell lines
                173            • Tend to regulate genes whose
                                 products control transcription
      310                        and development

450




                                               David King
Most pTRRs are conserved in eutherian mammals
                                 Percentage of class that align no further than:
                                             pTRRs      DNase HSs Promoters
                                     Primates: 3%          11%           1-13%
  Millions of
  years               91

                                    Eutherians: 71%        70%           63%
                173


      310
                                    Marsupials: 21%        14%         16-28%
450


                                     Tetrapods: 4%          4%           4-7%

                                     Vertebrates: 1%        1%           2-4%


Within aligned noncoding DNA of eutherians, need to distinguish constrained
DNA (purifying selection) from neutral DNA.
Measures of conservation and constraint
   capture only a subset of pTRRs




    Fraction overlapping   phastCons          Composite alignability
    an MCS                 (background rate   (background rate
                           corrected)         corrected)


                           Allows a range     Aligns, but no
   Stringent constraint
                           of constraint      inference about
                                              purifying selection
              Different measures perform better on specific
                           functional regions
Sensitivity




                                 1-Specificity
Examples of clade-specific pTRRs
    Messages about evolutionary approaches to
          predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
  phylogenetic distance.
• Incorporation of pattern and composition information
  along with with conservation can lead to effective
  discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
  GATA-1 binding motif is an effective predictor of enhancer
  activity.
• In vivo occupancy by GATA-1 suggests other activities in
  addition to enhancers.
• Comparison of polymorphism and divergence from closely
  related species can reveal regulatory regions that are
  under recent selection.
Regulatory potential (RP) to distinguish
          functional classes
Good performance of ESPERR for gene
       regulatory regions (RP)




 -



                           Francesca
            James Taylor   Chiaromonte
    Messages about evolutionary approaches to
          predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
  phylogenetic distance.
• Incorporation of pattern and composition information
  along with with conservation can lead to effective
  discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
  GATA-1 binding motif is an effective predictor of enhancer
  activity.
• In vivo occupancy by GATA-1 suggests other activities in
  addition to enhancers.
• Comparison of polymorphism and divergence from closely
  related species can reveal regulatory regions that are
  under recent selection.
Conservation of predicted binding sites for
          transcription factors

Binding site for GATA-1
Genes Co-expressed in Late Erythroid Maturation
G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1.
Can rescue by expressing an estrogen-responsive form of GATA-1
Rylski et al., Mol Cell Biol. 2003
Predicted cis-Regulatory Modules (preCRMs)
          Around Erythroid Genes




                    B:Yong Cheng, Ross, Yuepin Zhou, David King
                    F:Ying Zhang, Joel Martin, Christine Dorman, Hao
                    Wang
preCRMs with conserved consensus GATA-1 BS
   tend to be active on transfected plasmids
preCRMs with conserved consensus GATA-1
 BS tend to be active after integration into a
               chromosome
Examples of validated preCRMs
Correlation of Enhancer Activity with RP Score
Validation status for 99 tested fragments
 preCRMs with High RP and Conserved
Consensus GATA-1 Tend To Be Validated
         CACC box helps distinguish validated from
                nonvalidated preCRMs
   All validated preCRMs
                               Same parameters

All nonvalidated preCRMs
                                                 Compare the outputs




    Consensus for EKLF binding site:

       CCNCMCCCW




                                                 CCNCMCCCW
               Ying Zhang                            CCNCMCCCW
    Messages about evolutionary approaches to
          predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
  phylogenetic distance.
• Incorporation of pattern and composition information
  along with with conservation can lead to effective
  discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
  GATA-1 binding motif is an effective predictor of enhancer
  activity.
• In vivo occupancy by GATA-1 suggests other activities in
  addition to enhancers.
• Comparison of polymorphism and divergence from closely
  related species can reveal regulatory regions that are
  under recent selection.
preCRMs with conserved consensus GATA-1 binding
sites are usually occupied by that protein: ChIP assay
     Design of ChIP-chip for occupancy by GATA-1

1.    Non-overlapping tiling array with 50bp probe and 100bp
      resolution (NimbleGen)

               50      50


                       100



2. Cover range
  Mouse chr7:57225996-123812258 (~70Mbp)
3. Antibody against the ER portion of GATA-1-ER protein
   in rescued G1E-ER4 cells


                    Yong Cheng, with Mitch Weiss & Lou Dore
                    (CHoP), Roland Green (NimbleGen)
Signals in known occupied sites in Hbb LCR
      HS1        HS2                    HS3




     1) Cluster of high signals
     2) “hill” shape of the signals
             Peak Finding Programs
• TAMALPAIS
  Mark Bieda from Peggy Farmham’s lab
  Focus more on the cluster of the signals
  4 thresholds based on number of consecutive probes
    with signals in the 98th or 95th percentiles


• MPEAK
  Bing Ren’s lab
  Focus more one the “hill” shape of the signal
  4 thresholds, for a series of probes with at least one that
    is 3, 2.5, 2 or 1 standard deviations above the mean
  ChIP-chip hits for GATA-1 occupancy

                   Technical replicates of ChIP-chip
                   with antibody against GATA1-ER

      Mpeak                                            TAMALPAIS


275 hits in both      59       216        60     276 hits in both




                      321 total ChIP-chip hits
     ChIP-chip hits validate at a high rate
 Validation determined by quantitative PCR.
 19 of the 321 hits were tested.
 13 (~70%) were validated.



ChIP DNA




 Validation rate is similar at different thresholds

  9 regions were “hits” in only one of the two technical replicates.
  None were validated.
      Association of WGATAR and
     conservation with ChIP-chip Hits

1. 249 out of the 321 (78%) have WGATAR
   motifs, binding site for GATA-1
2. Of the GATA-1 binding motifs in those 249
   hits, 112 (45%) are conserved between
   mouse and at least one non-rodent species.
Expected and unexpected ChIP-chip hits
Distribution of ChIP-chip hits on 70Mb of
                mouse chr7




        Yong Cheng, Yuepin Zhou and Christine Dorman
                                                                                                                   Fold change over parent
                                                                                                   G
                                                                                                    H




                                                                                                               0
                                                                                                                     1
                                                                                                                             2
                                                                                                                                     3
                                                                                                                                             4
                                                                                                      P
                                                                                                    G 18
                                                                                                      H 1
                                                                                                        P1
                                                                                                   G GH 0
                                                                                                    H P
                                                                                                   G P1 7
                                                                                                    H 8
                                                                                                      P3 2




                                                                                                                                             15 6 6
                                                                                                         0
                                                                                                   G GH 9
                                                                                                    H P
                                                                                                   G P1 1
                                                                                                    H 8
                                                                                                      P2 6
                                                                                                         0
                                                                                                   G GH 5




    43%
                                                                                                    H P
                                                                                                   G P3 4
                                                                                                    H 1
                                                                                                   G P1 4
                                                                                                    H 7
                                                                                                      P 2
                                                                                                    G 16
                                                                                                   G HP 7
                                                                                                    H 7
                                                                                                      P 4
                                                                                                    G 19
                                                                                                      H 3
                                                                                                        P2
                                                                                                   G GH 7
                                                                                                    H P
                                                                                                      P 9
                                                                                                    G 17
                                                                                                      H 0
                                                                                                    G P1
                                                                                                   G HP 8
                                                                                                    H 1
                                                                                                      P 6
                                                                                                    G 24
                                                                                                      H 3
                                                                                                    G P1
                                                                                                      H 5
                                                                                                    G P2
                                                                                                      H 8
                                                                                                    G P1
                                                                                                      H 7
                                                                                                    G P3
                                                                                                   G HP 1
                                                                                                    H 1
                                                                                                   G P1 1
                                                                                                    H 9
                                                                                                      P 8
                                                                                                    G 16
                                                                                                   G HP 9
                                                                                                    H 1
                                                                                                      P 4
                                                                                                    G 17
                                                                                                   G HP 3
                                                                                                    H 2
                                                                                                      P 9
                                                                                                    G 19
                                                                                                      H 9
                                                                                                        P
                                                                                                     G 12
                                                                                                    G HP
                                                                                                   G HP 3
                                                                                                    H 2
                                                                                                      P 4
                                                                                                    G 16
                                                                                                      H 4
                                                                                                    G P1
                                                                                                      H 3
                                                                                                    G P3
                                                                                                      H 0
                                                                                                    G P1
                                                                                                   G HP 9
                                                                                                    H 2
                                                                                                   G P1 6
                                                                                                    H 6
                                                                                                   G P1 1
                                                              GATA-1 occupied sites by ChIP-chip
                                                                                                    H 9
                                                                                                   G P1 1
                                                                                                    H 9
                                                                                                   G P1 7
                                                                                                    H 8
                                                                                                      P1 3
                                                                                                     G 84
                                                                                                    G HP
                                                                                                   G HP 6
                                                                                                    H 2
                                                                                                   G P2 3
                                                                                                    H 0
                                                                                                   G P1 6
                                                                                                    H 9
                                                                                                      P2 4
                                                                                                         0
24 validated out of 56 fragments with ChIP-chip hits tested

                                                                                                   G GH 2
                                                                                                    H P
                                                                                                      P2 0
                                                                                                         0
                                                                                                   G GH 0
                                                                                                    H P
                                                                                                   G P1 8
                                                                                                    H 8
                                                                                                      P 5
                                                                                                    G 11
                                                                                                   G HP 8
                                                                                                    H 2
                                                                                                      P2 0
                                                                                                   G 04
                                                                                                                                                          expression of a transgene, K562 cells




                                                                                                    H
                                                                                                   G N5
                                                                                                    H
                                                                                                   G N034
                                                                                                    H
                                                                                                   G N106
                                                                                                    H
                                                                                                   G N033
                                                                                                    H 3
                                                                                                     N 7
                                                                                                       3
                                                                                                   G Y 22
                                                                                                                                                      Almost half the GATA-1 ChIP-chip hits increase




                                                                                                    H C
                                                                                                     N 3
                                                                                                       21
                                                                                                           3
                                                                  No GATA-1
   Conserved and nonconserved ChIP-chip
      hits can be active as enhancers


Conserved, active




     Conserved, not active   Not conserved, active
    Messages about evolutionary approaches to
          predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
  phylogenetic distance.
• Incorporation of pattern and composition information
  along with with conservation can lead to effective
  discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
  GATA-1 binding motif is an effective predictor of enhancer
  activity.
• In vivo occupancy by GATA-1 suggests other activities in
  addition to enhancers.
• Comparison of polymorphism and divergence from closely
  related species can reveal regulatory regions that are
  under recent selection.
Polymorphism as a transient phase of evolution




Slide from Dr. Hiroshi Akashi
Test of neutrality using polymorphism and
             divergence data
Test for recent selection in human noncoding DNA

•   McDonald-Kreitman test
•   Use ancestral repeats as neutral model (MKAR test)
•   Count polymorphisms in human using dbSNP126
•   Count divergence of human from
    – Chimpanzee (great Ape, diverged from human lineage 6 Myr ago)
    – Rhesus macaque (Old World Monkey, diverged from human lineage 23
      Myr ago)
• Tiled windows, most analysis on 10kb windows
• Compute p-value for neutrality by chi-square test
• Ratio of polymorphism to divergence ratios gives indication of
  direction of inferred selection


                          Heather Lawson, Anthropology, PSU
pTRR apparently under positive selection
A promoter distal to the beta-like globin genes
  has a signal for recent purifying selection
Selection on a primate-specific promoter
The distal promoter is close to the locus control
         region for beta-globin genes
    Messages about evolutionary approaches to
          predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
  phylogenetic distance.
• Incorporation of pattern and composition information
  along with with conservation can lead to effective
  discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
  GATA-1 binding motif is an effective predictor of enhancer
  activity.
• In vivo occupancy by GATA-1 suggests other activities in
  addition to enhancers.
• Comparison of polymorphism and divergence from closely
  related species can reveal regulatory regions that are
  under recent selection.
                                Many thanks …



                                                   PSU Database crew: Belinda Giardine,
                                                   Cathy Riemer, Yi Zhang, Anton Nekrutenko
B:Yong Cheng, Ross, Yuepin Zhou, David King
F:Ying Zhang, Joel Martin, Christine Dorman, Hao
Wang




                                              RP scores and other bioinformatic input:
                                              Francesca Chiaromonte, James Taylor, Shan Yang,
 Alignments, chains, nets, browsers, ideas, … Diana Kolbe, Laura Elnitski
 Webb Miller, Jim Kent, David Haussler


             Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU
            Computing Regulatory Potential (RP)
           Alignment seq1 G T A C C T A C T A C G C A
                   seq2 G T G T C G - - A G C C C A
                   seq3 A T G T C A - - A A T G T A
           Collapsed alphabet 1 2 1 3 4 5 7 7 6 8 3 6 3 9

•A 3-way alignment has 124 types of columns. Collapse these to a smaller alphabet
with characters s (for example, 1-9).

•Train two order t Markov models for the probability that t alignment columns are followed
by a particular column in training sets:
     –positive (alignments in known regulatory regions)
     –negative (alignments in ancestral repeats, a model for neutral DNA)
     –E.g. Frequency that 3 4 is followed by 5:
           0.001 in regulatory regions
           0.0001 in ancestral repeats

•RP of any 3-way alignment is the sum of the log likelihood ratios of finding the strings of
alignment characters in known regulatory regions vs. ancestral repeats.
                             pREG (sa | sa1 ...sat ) 
            RP   log                                 
                a in segment  pAR (sa | sa1 ...sat ) 
             Stage 1: Reduced representations




                                                gap
                                  G
ESPERR: Evolutionary
Sequence and Pattern Extraction
using Reduced Representations         T
Stage 2: Improve encoding
                Train models for classification
                                            6 6 2 may occur frequently in positive
                                            training set and rarely in the negative
                                            training set, and thus contribute to
                                            discrimination.
                                            If the positive training set is known
                                            regulatory regions, this would
                                            contribute to a positive RP.




Note that many different columns are reduced to single “encoding” (a number in the
figure). E.g. Four different columns are each called “3”.
Categories of Tested DNA Segments
             Example that suggests turnover


GATA-1 BSs
     Additional methods find CACC box as distinctive for
                          validation
              All validated preCRMs                       All nonvalidated preCRMs




       CLOVER (Zlab)                                          Hexamer Counting
Background:
                    EKLF PWM
Mouse chr 19        (Dr. Perkins)
                                               ELPH
(42.8% C+G) -
NCBI Build 30
                                            (UMaryland)
  Output for validated preCRMs                                 counts   validated nonvalidated
     Motif         P(mm_chr19.m)                               NCACCC           60           32
    EKLF               0.0008                                  CACCCW           56           27
 Output for nonvalidated preCRMs                               expected validated nonvalidated
     Motif         P(mm_chr19.m)                               NCACCC        16.31         5.81
     none               none                                   CACCCW        11.74         4.36

                                           validated   non-validated
                                   6-mer   TTATYT      GGCAGR
                                   7-mer   CCWCAGM     RGRCAGR
                                   8-mer   CASCCWGC    CAGGGAWR
                                   9-mer   CCWGGCWGM   CWGRGAWRA
Using Galaxy to find predicted CRMs

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:10/18/2011
language:English
pages:67