SNP Discovery and Genome Variation Analysis by zhouwenjuan

VIEWS: 16 PAGES: 61

									The Structure of DNA Sequence
Variation in the Human Genome



Gabor Marth
marth@ncbi.nlm.nih.gov

National Center for Biotechnology Information
Computational Biology Branch
Bethesda, Maryland, USA
1. Sequence Variations
                What Are Sequence Variations?

genome reference sequence




                                         DNA sequence variations




Single-nucleotide polymorphisms (SNPs)             Deletion/Insertion
                                                 Polymorphisms (DIPs)
                Why Study Sequence Variations?
 • cause phenotypic differences




                                              • often associated
                                              with disease




• are the result of the origins of populations such as
population size history, demography, migration
• carry signals of important molecular processes such as
the mutation process and recombination
    Major Sources of Genomic Polymorphism Data
• Expressed sequence tags (ESTs)




• Random genomic reads of “The SNP Consortium” (TSC)




• Overlapping regions of genomic clone sequences
                     What do we know about SNPs?

• nucleotide diversity of the human genome

    • ~ 1 SNP / 1.3 kbp between any pair of sequences

    • frequency varies by variation type (transitions/transversions)

    • functionally constrained DNA is less diverse




• allele frequency

    • rare SNPs are more numerous

    • newer alleles are of lower frequency

    • functionally important alleles are of lower frequency
2. How To Find SNPs?
             How to detect SNPs?
  samples          population            samples




• detection of rare SNPs requires deeper sampling
                  Allele Frequency and Sample Size


                           Prob(k alleles of N = 20)
        0.8
 Prob




        0.6
                    p = 0.02   p = 0.1    p = 0.5
        0.4

        0.2

         0
              0            5         10             15            20
                                                         k alleles



• expected allele distribution in DNA samples matches allele
frequency in sampled population
Allele Frequency and Sample Size (continued)

                    Prob(both alleles present)

                        p = 0.02   p = 0.1   p = 0.5
            1
   Prob



          0.8
          0.6
          0.4
          0.2
            0
                0          5       10        15        20
                                                       N


 • Frequent alleles can be found at lower sample size
 • Discovery of rare alleles requires deep sampling
Discovery by Computational Sequence Comparison

• Explores variations in regions of
redundant sequence coverage



• Existing sequence resources can be mined
     • genomic sequence
     • ESTs
     • STSs
     • BAC-end sequences


• Typical sample size ~ 2-20 individuals


• Sequencing errors also make detection of rare alleles difficult


• Cost-effective way of genome-wide SNP discovery
Steps of SNP Mining


                 Sequence clustering


                 Paralog identification



                 Multiple alignment



                 SNP detection
                The PolyBayes Approach
1. Repeat masking        genomic reference sequence

 ESTs                   2. Database
                        search
         3. Anchored                       4. Paralog
         alignment                         identification




                                      6. Experimental validation



     5. SNP detection
                   Sequence Clustering

• Clustering = database search against dbEST


• Clusters are groups of overlapping ESTs matching the genomic
reference


                                               clone sequence
                                               ESTs


   cluster 1         cluster 2     cluster 3
  Multiple Alignment with an Anchored Technique
• The genomic reference sequence serves as an anchor
    • ESTs pair-wise aligned to genomic sequence
    • insertions are propagated -- “sequence padding”




• Advantages
    • efficient -- only involves pair-wise comparisons
    • accurate -- correctly aligns alternatively spliced ESTs
                      Paralog(ue) Identfication



• unrecognized paralogs give rise to spurious SNP predictions
• SNPs in duplicated regions may be useless for genotyping



                                                                Sequencing
                                                                errors

                                                                Paralogous
                                                                difference
                      Bayesian Paralog Identification Algorithm
• Pair-wise comparison between EST and genomic sequence


• Calculate number of discrepancies expected from sequencing errors, from
the base quality values
• If significantly more errors are observed than expected, the sequence is
likely paralogous

                                  Paralog discrimination

                                                                P(d|Model_NAT)
                  1
                0.9                                             P(d|Model_PAR)
                0.8
  Probability




                0.7                                             P(Model_NAT|d)
                0.6
                0.5
                0.4
                0.3
                0.2
                0.1
                  0
                      0   1   2   3   4   5     6   7   8   9   10 11 12 13 14 15
                                              Discrepancies (d)
                                    Paralog Removal
               Distribution of P(NAT) probability values

            1200
            1000
sequences
Number of




             800
             600
             400
             200
               0
                   0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

                                        P(NAT)


                                          paralog native
                        SNP Detection

Goal: to discern true variation from sequencing error




   sequencing error      polymorphism
                             Prior Information

• Sequence context


• Expected polymorphism rate -- e.g. 1 / 1000 bp




                                                      Relative occurance
• Relative rate of specific variation
                                                                           70
                                                                           60
                                                                           50
                                                                           40
                                                                           30
                                                                           20
                                                                           10
                                                                            0

                                                                                    AC         AG          AT          CG
                                                                                               Variation type




• Sample size (alignment depth)                            0.8
                                                                                           Prob(k alleles of N = 20)




                                                   Prob
                                                           0.6
                                                                                    p = 0.02     p = 0.1     p = 0.5
                                                           0.4

                                                           0.2

                                                                           0
                                                                                0          5            10             15            20
                                                                                                                            k alleles
         Observed Data




True SNP in low quality data

                          Sequencing errors
                                     Bayesian-Statistical SNP Detection

                                                                                                                            http://genome.wustl.edu/gsc/polybayes




 Likelihood of                            Base call &                               Polymorphism
polymorphism                              Base quality                                  rate



                                                   P( S1 | R1 )           P( S N | RN )
                                                                   ...                    PPr ior ( S1 ,...,S N )
                                                   PPr ior ( S1 )          PPr ior ( S N )
 P( SNP )                                                        P( S | R )              P( S | R )
              all var iable   S
                                       ,G ,T ] ...S [,G ,T ] P i1( S 1 )  ... P iN( S 1 )  PPr ior ( Si1 ,...,SiN )
                                Si1 [ A ,C              A ,C
                                                    iN                Pr ior     i1          Pr ior    iN




                                                       Base                                         Depth of
                                                    composition                                     coverage
        Computational SNP Candidate




SNP score
      SNP Score = Predicted Confirmation Rate


                                       SNP confirmation rate

             Confirmation rate                      SNPs confirmed

                                 80
                                 60
                                 40
                                 20
                                 0
                                      0.37 - 0.59    0.60 - 0.79     0.80 - 1.00
                                                      P(SNP)


• Higher score corresponds to higher confirmation rate

• the SNP score allows one to balance confirmation rate and the
recovery of rare SNPs (or SNPs in low quality data)
                             SNP Scores are Realistic

                                          African                                       SNP confirmation rate
                                                                                                       SNPs confirmed
                                          Asian
EST SNPs - candidates




                                                            Confirmation rate
                                                                                80
validated by sequencing in                                                      60
                                          Caucasian
population-specific pools                                                       40
                                                                                20
                                          Hispanic
                                                                                  0
                                                                                      0.37 - 0.59       0.60 - 0.79       0.80 - 1.00
                                          CHM 1                                                          P(SNP)




                                                      Confirmation rate [%]
                                                                                100

                                                                                80
TSC SNPs - candidates validated by re-sequencing
                                                                                60

                                                                                40

                                                                                20

                                                                                 0
                                                                                      51-60    61-70       71-80       81-90    91-100

                                                                                                       SNP score [%]
            Artifacts Leading to False Positives
Internal priming




Reverse transcriptase error (or rare allele, or individual variation)




Trace compression
           Deletion/Insertion Polymorphisms (DIPs)

There is no “base quality” value    Sequencing chemistry context-dependent
for “deleted” nucleotide(s)




No reliable prior expectation for
INDEL rates of various classes
                Sequencing Based Genotyping
• Sequences collected from a single genomic location
• No clustering or sequence paralog problem




• Sequence traces may represent heterozygous (non-unique) DNA
• Base quality values do not convey heterozygous trace information
• Trace data needs to be analyzed




                                 Heterozygous trace segment
The PolyPhred Software



                      Heterozygous trace peak




                      Homozygous trace peak



      http://droog.mbt.washington.edu/PolyPhred.html
3. Initial Lessons
                    The Current SNP Resource

http://www.ncbi.nlm.nih.gov/SNP

                                  • the current public resource (dbSNP)
                                  contains nearly 2 million SNPs
                                  • dense genome map of polymorphic
                                  markers


                                  • what can we learn about the forces that
                                  shape human variability?

                                  • what is its power for statistical association
                                  studies ?


                                  • we examined SNPs in overlaps of large-
                                  insert genomic clone sequences
                  SNP Mining Results

                               ~ 30,000 clones




>CloneX         >CloneY
ACGTTGCAACGT    ACGTTGCAACGT
                               25,901 clones
GTCAATGCTGCA    GTCAATGCTGCA   (7,122 finished, 18,779 draft
                               with basequality values)


                               21,020 clone overlaps
                               (124,356 fragment overlaps)



   ACCTAGGAGACTGAACTTACTG
                               507,152 candidate SNPs
   ACCTAGGAGACCGAACTTACTG
                               83% verification rate in
                               pooled samples
                          Ascertainment in Shallow Sequence Coverage

600                                   Length [Mb]    SNP Rate [per 10,000 bp]                     16

500
                                                                                                  14
                                                                                                  12
                                                                                                       SNP rate increases with
400
                                                                                                  10   the depth of coverage
300                                                                                               8
                                                                                                  6
200
                                                                                                  4
100
                                                                                                  2
        0                                                                                         0
                          2           3         4         5           6         7             8




                    0.5


                    0.4
 Fraction of SNPs




                    0.3
                                                                                                       Ascertainment in shallow coverage
                    0.2                                                                                is biased towards common SNPs
                    0.1


                     0
                              0-0.1       0.1-0.2      0.2-0.3        0.3-0.4       0.4-0.5
                                               Minor Allele Frequency
                       Nucleotide Diversity

• Genome average nucleotide diversity of 6.51 x 10-4 per nucleotide

• Significant differences between human chromosomes




• Large amount of heterogeneity at every level of genome organization
                Recombination Rates Vary in the Genome

SNP Rate [per 10,000 bp]   10


                           9


                           8


                           7


                           6


                           5
                                0   0.5   1      1.5      2        2.5      3   3.5   4
                                              Recombination rate [per Mb]




• recombination rate varies widely in the genome

• recombination rate positively correlated with variation rate
                                    Mutation Rate Depends on Base Composition

                                G+C nucleotide content                                                           CpG di-nucleotide content
                                                                                                           8
                           8
SNP Rate [per 10,000 bp]




                                                                                SNP Rate [per 10,000 bp]
                           7                                                                               7




                           6                                                                               6




                           5
                                                                                                           5
                               30    33   36   39    42     45   48   51   54                                  0.3   1.2   2.1        3       3.9   4.8   5.7
                                               G+C Content [%]                                                                   CpG Content [%]




                                                    • strong positive correlation with variation rate
                                                    • 12-fold increase in CpG di-nucleotide sequence
                                                    • 24.2 % of all SNPs were variations in CpGs
                                                    • bi-modality in the mutation process
      Diversity Varies in Human Repeat Families


• overall diversity in known Human repeats only 6.8 % over genome average


• large differences in diversity between repeat families



• highest diversity observed in Alu sequences (29 % over genome average)




• repeat sequences contribute to differential mutation rates

• possible additional roles in recombination
Selection: Diversity in Functional Units of Genes

• diversity follows functional constraint

       3’ UTR                       5.00 x 10-4
       5’ UTR                       4.95 x 10-4
       Exon, overall                4.20 x 10-4
       Exon, coding                 3.77 x 10-4

• coding sequence changes

       1st codon position           152 / 653
       2nd codon position           127 / 653
       3rd codon position           374 / 653

       Synonymous changes     366 / 653
       Non-synonymous changes 287 / 653
4. Signals of Population History
                                                                                                           Theory of Sequence Variation

• fundamental forces that structure sequence variation
                                    • genetic drift (T)                                                                                                                   • mutation process (µ)
                                    • population history (N)                                                                                                              • selection (s)

                                     • recombination (r)

• these forces shape observable SNP distributions
           0.08
                                                                                                                                                     0.18




                                                                                                                                Fraction of Sample
           0.06
                                                                                                                                                     0.12

                                                                                                       marker density                                                                            allele frequency
   Pr(k)




           0.04

                                                                                                                                                     0.06
           0.02


                                                                                                                                                       0
               0                                                                                                                                            0.1     0.2    0.3    0.4      0.5
                   0               5             10               15               20             25
                                                 k differences
                                                                                                                                                                  Minor Allele Frequency




• good models improve our expectation of linkage disequilibrium,
haplotype structure, and statistical power in association studies
       0.8



       0.6



  D2 0.4


                                                                                                       linkage disequilibrium
                                                                                                                                                                                  haplotype structure
       0.2



           0
                -6

                            -6

                                      -5

                                                -5

                                                          -4

                                                                    -4

                                                                              -3

                                                                                        -3

                                                                                                  -2
              10

                          10

                                    10

                                              10

                                                        10

                                                                  10

                                                                            10

                                                                                      10

                                                                                                10
           1x

                       3x

                                 1x

                                           3x

                                                     1x

                                                               3x

                                                                         1x

                                                                                   3x

                                                                                             1x




                                           Recombination Fraction
                               Observable Distribution #1: Marker Density

marker density
                                                                                         0.08



                                                                                         0.06




                                 k=11              k=1




                                                                                 Pr(k)
                                                                                         0.04




                                 k=7               k=12                                  0.02



                                                                                           0

                                 k=4               k=0                                          0         5     10        15
                                                                                                                k differences
                                                                                                                                 20        25




probability distribution of the number of SNPs observed when comparing sequences of a given length e.g.
the likelihood of seeing 0,1,2,… SNPs in a 12 kb alignment


expectation: exponential growth (Kruglyak 1999)
                                                             0.25


                                           past
            N2 = 10 thousand




                                                              0.2           Kruglyak 1999

                                                             0.15           Observed Data
                                                     Pr(k)




                                                              0.1

                                                             0.05

                                                               0
                                                                    0   5      10                    15        20           25        30        35
                                                                                                    k differences [12 kb]



                                                      poor fit to observed data
                                         present
     Measuring Marker Density (SNPs from Clone Overlaps)

• testing and parameter estimation on the genome scale requires:
      • statistical resolution (lot of data points)

      • variety of length scales

      • well-defined ascertainment conditions

• marker density was calculated from SNP candidates in BAC overlaps in the
genome sequence




     • 23% of genome sequence
     • 500,000 high-scoring candidate SNPs
     • long continuous regions (4-50 kb)
     • uniform, well-characterized properties
    Modeling Random Genetic Drift – the Coalescent
• simulation process
• describes statistical properties of variation in DNA samples
• produces possible genealogies backwards, towards MRCA +
generates mutations

                                                                  past




                                                                 present

• neutral mutation (each site is equally mutable)
• infinite sites model (no recurrent mutations)
                          Adding Population History

• true history is substituted by simple models to                                                          past
                                                                                                                                                          N3
describe the effects of a very complex past
                                                                                                               T2                                          N2
• succession of different (effective) population sizes
                                                                                                     present T1                                                 N1
in the past that change in a step-wise fashion

• distribution of SNP density can be calculated directly, for arbitrarily complex,
step-wise population history

• example for a 3-epoch            P k  
                                               1  1 L 
                                                               
                                                                          k
                                                                              
                                                                              
                                                                                                      k               i 
                                                                               1  e 11L 1 1   11L!1   
                                            1  1 L  1  1 L 
population history
                                                                                                               i
                                                                                                i 1                 
                                                        k               k  1 2 L  1  1  
                                                                                              
                                                                                                  i
                                                                                                                                                          
                                                                                                                               k  1 2 L  1  1   2  
                                                                                                                                                                 i

                                               2 L  11L 1                                                                        
                                                                                                                                                              
                                                                                                                                                             
                                                                                                       e 1 2 L  2 1   
                                          1
                                                                     1 
                                                                                        2       
                                            1   L  e
                                                                                                                                           2               
                                     1  2L      2 
                                                                      i 1          i!
                                                                                                                          i 1                i!
                                                                                                                                                                   
                                                                                                                                                               
                                                                                                                                                                 
                                                                                                                                                     
                                                                                                                                                  i
                                                                                                                                        
                                                                                                            k  1 2 L  1  1  2  2  
                                                               k
                                          1   3L                   11L  1  1 2 L  2 
                                                                                                                                           
                                                                                                                                                      
                                                                                                     1  
                                                                                                                          3      3      
                                                                e                                                                         
                                                                                                                                                      
                                     1  3L  1  3L 
                                                                                                                               i!
                                                                                                       i 1                                        
                                                                                                                                                     



• direct calculation permits fast generation of model densities
                            Adding Recombination
• recombination shuffles ancestral DNA – different segments of the samples can
have very different genealogies

                          • no recombination: completely correlated mutation history
                          • free recombination: un-correlated mutation history
                          • partial recombination: partly correlated mutation history
                                       0.3
                                       0.3                                                 0.3
                                                                                           0.3
                                                           No Recombination
                                                          No Recombination                                         No Recombination
                                                                                                                  No recombination
                                       0.3
                                                            No Recombination
                                                           FreeRecombination
                                                          Free Recombination                                       No Recombination
                                                                                                                   FreeRecombination
                                                                                                                  Free Recombination
                                       0.2                Partial Recombination            0.2                    Partial Recombination
                                       0.2                                                 0.2
                                       0.2
                               Pr(k)




                                                                                   Pr(k)
                              Pr(k)




                                                                                  Pr(k)
                             Pr(k)




                                       0.1
                                       0.1                                                 0.1
                                                                                           0.1
                                       0.1


                                        0
                                        0                                                   0
                                                                                            0
                                             0
                                             0            5
                                                          5                  10
                                                                             10                  0
                                                                                                 0   5
                                                                                                     5     10
                                                                                                           10     15
                                                                                                                  15     20
                                                                                                                         20      25
                                                                                                                                 25   30
                                                                                                                                      30
                                                 k differences [4 kb]
                                                 k differences
                                                 k differences [4 kb]                                    k differences [12 kb]
                                                                                                         k differences [12 kb]




                          • there is significant correlation between history of neighboring loci
                          • effect of recombination increases with DNA length


• expected marker density distribution is generated by simulation
               Fitting Marker Density with a 2-Epoch History Model

                       recombination


        0.25                 Observed                            stationary history           0.25             Observed                            stationary history
         0.2
                             Stationary, No recombination
                                                                 no recombination              0.2
                                                                                                               Stationary, Partial recombination
                                                                                                                                                   + recombination
        0.15                                                                                  0.15
Pr(k)




                                                                                      Pr(k)
         0.1                                                                                   0.1
                                                                       N1=7,800                                                                          N1=7,800
                                                                       r=0
        0.05                                                                                  0.05
                                                                                                                                                         r = 10-8
          0                                                                                     0
               0   5    10        15     20      25         30                                       0   5    10      15      20      25      30
                       k differences [12 kb]
                                                                 fit: -1,381                                 k differences [12 kb]
                                                                                                                                                   fit: -1,293


                                                                                              0.25
                                                                 two-stage history                                                                 two-stage history
        0.25                 Observed                                                                          Observed
                                                                                                               Bottleneck, Partial recombination
                             Bottleneck, No recombination
         0.2                                                     no recombination              0.2
                                                                                                                                                   + recombination
        0.15                                                                          Pr(k)   0.15

                                                                          N2=9,200                                                                          N2=8,500
Pr(k)




                                                                                               0.1
                                                                          N1=2,500                                                                          N1=2,500
         0.1

        0.05
                                                                          T1=500              0.05
                                                                                                                                                            T1=900
           0                                                              r=0                   0                                                           r = 10-8
               0   5    10        15     20      25         30                                       0   5    10      15      20      25      30

                                                                 fit: -1,194                                                                          fit: -978
                       k differences [12 kb]                                                                 k differences [12 kb]
                                      Observable Distribution #2: Allele Frequency Spectrum

allele frequency spectrum                                                                                                   0.06


                                                                                                                            0.05


                                                                                                                            0.04




                                                                                                                     P(i)
                                                                                                                            0.03


                                                                                                        n=11                0.02


                                                                                                                            0.01


                                                                                                                              0
                                                                                                                                   1   5   9           13   17


                                i=4                        i=2                     i=3       i=1
                                                                                                                                           i alleles




probability distribution of the relative allele counts of SNPs with a given frequency in the population,
e.g. the likelihood of seeing the allele ‘A’ in 3 individuals out of a sample of 11 chromosomes



                                0.3
    Fraction of Verified SNPs




                                0.2


                                                                                                   • usually described as ‘essentially flat’
                                0.1




                                 0
                                       0-0.1   0.1-0.2     0.2-0.3       0.3-0.4   0.4-0.5
                                                    Minor Allele Frequency
                       Measuring Allele Frequency
1. direct allele frequency estimates from pooled sequencing (Pui Kwok)
                  • majority of overlap SNPs were characterized by pooled sequencing
                  • shape seen in other data sets – noise or signal?

                                                                  0.3




                                      Fraction of Verified SNPs
                                                                  0.2




                                                                  0.1




                                                                   0
                                                                            0-0.1   0.1-0.2     0.2-0.3     0.3-0.4    0.4-0.5
                                                                                        Minor Allele Frequency




2. indirect allele frequency estimates from genotype data (Orchid, TSC)
                            • SNPs discovered in 2 chromosomes
                            • high resolution genotypes (42 Coriell Caucasian samples)
                                                                   0.08

                                                                   0.06
                                         P(i)




                                                                   0.04

                                                                   0.02

                                                                        0
                                                                            1          5            9             13         17
                                                                                                      i alleles
                        Modeling Allele Frequency



• allele frequency spectrum for stationary history is known (Fu 1995)



• allele frequency spectrum can be calculated directly, for step-wise histories
consisting of any number of ‘epochs’: (N1, T1), (N2,T2), (N3,T3), …..


• example for 3-epoch history

                                                                          n  k        
                                                                 k k  1 i  1  Et 
                                                                      n
                                         Pn i                                                         k ,k 1
                                                        n  1     k 2                                          
                                                        i 
                                                      i   
        past                                              
                             N3                                  n
                                                                
                                                                                     
                                                                               a 1   1  1    1  a 1
                                         E tk ,k 1     1 m 1   1   2   1  1 k 1 e  ai T1
                                                       n                                                         1 

                                                                            a m  ai   a k a  a k  a k  ai
                                                       i  k 1  m  k 1                   k     
                                                                 m i              
                T2            N2            1      1  a 1            1     1     1  a 2               1     2 
                                           2   1  1 i 1 e  ak T1   3  2   2  k 2  e  ai T1  ai T2
                                           a           a a                   a     a k  a k  ai
      present T1                  N1         k    ak  k       i                 k         
                                                                                                                                                     
                                            1     1  a 1             a 2   1  2              1     1  a 1             1     2   
                                           3  2   1 i 1  2  i 2  e  ai T1  ak T2   3  2   1 i 1 e  ak T1  ak T2  
                                           a
                                            k    a k  a k  a i
                                                                    a k  ai    
                                                                                                        a
                                                                                                         k    a k  a k  ai
                                                                                                                                                  
                                              Expected Distributions
                      stationary                                              expansion                                              collapse
     past
                                                                                                         N2                                                                         N2
            history

                                                        N
                                                                                                                          N1                                     N1
   present
                                 0.2                                                     0.2                                                    0.2

                                0.15                                                   0.15                                                    0.15
                       Pr(k)




                                                                                                                                      Pr(k)
                                                                               Pr(k)
                                 0.1                                                     0.1                                                    0.1

marker density                  0.05                                                   0.05                                                    0.05

                                      0                                                      0                                                       0
                                          0   5    10       15     20    25                      0   5    10       15     20    25                       0   5    10       15       20    25
                                                  k differences                                          k differences                                           k differences




                                                                                       0.1
                                0.1                                                                                                            0.1


                               0.08                                                  0.08                                                     0.08



allele frequency               0.06                                                  0.06                                                     0.06




                                                                                                                                      P(i)
                                                                              P(i)
                       P(i)




                               0.04                                                  0.04                                                     0.04



                               0.02                                                                                                           0.02
                                                                                     0.02


                                 0                                                                                                              0
                                                                                        0
                                      1       5     9         13    17                                                                               1       5     9           13    17
                                                                                             1       5     9         13    17
                                                    i alleles                                              i alleles                                               i alleles
                           2-Epoch Model Showed Population Collapse

                                                                                                         • very good fit in short and
  0.4
                                                                                                         long overlaps
  0.3

                                                                                                                              past
   0.2                                                                                                    N2
   0.1




                                                                                                          N1
        0
         0.00
                                                                                                                          present



                                                                                                 16 kb
                5.00
                       10.00




                                                                                         12 kb
                               15.00
                                       20.00
                                               25.00                              8 kb
                                                       30.00
                                                                                                         • collapse history
                                                                           4 kb




                                                               35.00
                                                                       40.00




• this contradicted an expected signal of population growth (Kruglyak 1999)
• confirmed by experimental data (Reich et al. 2001)


• is there a sign of population recovery in the data (3-epoch model)?
• what signal do we see in the allele frequency spectrum?
                                               Observed Allele Frequency Spectrum

                      0.08

                      0.06
          P(i)




                      0.04

                      0.02

                                0
                                               1      5             9                         13                        17
                                                                    i alleles

        0.1




                                                                          “bottleneck” history
       0.08


       0.06
P(i)




                                                     (collapse?)
       0.04


       0.02
                                                                                        0.1
          0
              1   5    9            13    17
                        i alleles                                                      0.08


                                                                                       0.06
                                                                                P(i)




        0.1                                                                            0.04

       0.08

                                                                                       0.02


                                                     (expansion?)
       0.06
P(i)




       0.04
                                                                                         0
       0.02                                                                                   1    5   9           13   17
                                                                                                       i alleles
         0
              1   5    9         13      17
                       i alleles
                                                Fitting Allele Frequency Data

• started exploring parameter space for a three-epoch population model
                                                                         model                                                         data
                                                               0.1
                                                                                                                            0.08



                                                                                                     compare
                                                              0.08
                                                                                                                            0.06


   N1, N2, N3, T1, T2




                                                                                                                     P(i)
                                                              0.06
                                                                                                                            0.04




                                                       P(i)
                                                              0.04
                                                                                                                            0.02

                                                              0.02
                                                                                                                              0
                                                                                                                                   1   5   9           13    17
                                                                0
                                                                     1   5   9           13   17                                           i alleles
                                                                             i alleles




                                                                                                   goodness of fit

• various parameter set produce similar                                                                   • best fitting sets represent
degrees of fit                                                                                            bottleneck histories

   0.08
   0.07
   0.06
                                                                                                                                                                  N3=8,000
   0.05
   0.04
   0.03
   0.02
   0.01                                                                                                                                                     N2=3,000
     0
          1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20
                                                                                                                                                            T2=2,000 gen.
                                                                                                                                                                    N1=15,000
                                                                                                        present                                                     T1=800 gen.
        Are Marker Density and Allele Frequency Consistent?

• can a bottleneck history simultaneously explain density and frequency?

• perfect match may only be realistic if underlying samples represent the
same history (which is not our case), and our cartoon models are adequate
• there are examples of parameter sets that provide visually acceptable dual fit
 0.08                                                                                                                                         marker density
 0.07

 0.06
 0.05
                                                                             0.4
 0.04
 0.03                                                                        0.3

 0.02
                                                                              0.2
 0.01
   0
                                                                              0.1
        1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20

                                                                                   0                                                                                                         16kb


allele frequency
                                                                                    0.00                                                                                                  16 kb
                                                                                           5.00                                                                                       12 kb
                                                                                                  10.00                                                                           12 kb
                                                                                                          15.00                                                             8kb
                                                                                                                  20.00                                                 8 kb
                                                                                                                          25.00                                  4 kb
                                                                                                                                  30.00                   4 kb
                                                                                                                                          35.00
                                                                                                                                                  40.00




• we are currently exploring the (very large) parameter space
• collapse or bottleneck history seems to explain observed data
                     Predictions: Extent of Linkage Disequilibrium
  0.8
                                                     Kruglyak 1999

  0.6                                                Best-Fitting Model


D2 0.4


                                                                             far greater LD than previously expected
  0.2



    0
   1.0E-06 3.0E-06 1.0E-05 3.0E-05 1.0E-04 3.0E-04 1.0E-03 3.0E-03 1.0E-02
                            Recombination Fraction



                                                     d2=0.1 at ~ 100kb
                d2=0.4    at over 10 kb
                                                                                                    300
                                                                                                    270
                                                                                                    240                                    D2=0.1
                                                                                                                                           D2=0.2




                                                                                Extent of LD [kb]
                                                                                                    210
                                                                                                    180                                     D2=0.3
                                                                                                    150                                    D2= 0.4
                                   extent of LD depends on                                          120

                                   local recombination rates                                        90
                                                                                                    60
                                                                                                    30
                                                                                                     0
                                                                                                          0.5   1   1.5   2    2.5    3   3.5    4     4.5   5
                                                                                                                    Recombination Fraction cM per Mb
                                    Predictions: Haplotype Structure



                                                          block: major haplotypes (> 10% freqeuncy)
                                                          constitute > 80% of all haplotypes
physical block size




                                                  block
                                                          size distribution of longest block in a
                                                          simulated, 50 kb DNA segment, under
                                                          bottleneck model (mean: 14.5 kb)
                                                                                0.3

                                                                               0.25




                                                           fraction of total
                                                                                0.2

                                                                               0.15

                                                                                0.1

                                                                               0.05

                                                                                 0

                                                                                              12

                                                                                                   16

                                                                                                        20

                                                                                                             24

                                                                                                                   28

                                                                                                                         32

                                                                                                                              36

                                                                                                                                   40

                                                                                                                                        44

                                                                                                                                             48
                                                                                  0

                                                                                      4

                                                                                          8


                      physical marker locations                                                          block length [kb]
                       Sampling Differences in Allele Frequency

Orchid – ‘Caucasian’ spectrum
        0.08

        0.06
 P(i)




        0.04

        0.02
                                                    severe bottleneck
          0
               1   5     9            13   17
                          i alleles

                                                        Orchid – ‘African American’ spectrum
                                                                0.08

                                                                0.06


                                                         P(i)
                                                                0.04
                                      mild bottleneck
                                                                0.02

                                                                  0
                                                                       1   5   9           13   17
                                                                               i alleles
                                                       Consequences of Differential History

differences in population history predict (modest) differences in:


                                                                                                                    haplotype block size

extent of LD
                                                                                                                                               0.3

                                                                                                                                              0.25




                                                                                                                     fraction of total
     0.8
                                                                                                                                               0.2
                                                                 Orchid-Caucasian
                                                                                                                                              0.15
     0.6                                                         Orchid-AfrAm

                                                                                                                                               0.1
 2
D 0.4

                                                                                                severe bottleneck
                                                                                                                                              0.05


     0.2                                                                                                                                        0




                                                                                                                                                                 12

                                                                                                                                                                      16

                                                                                                                                                                           20

                                                                                                                                                                                24

                                                                                                                                                                                      28

                                                                                                                                                                                            32

                                                                                                                                                                                                 36

                                                                                                                                                                                                      40

                                                                                                                                                                                                           44

                                                                                                                                                                                                                48
                                                                                                                                                 0

                                                                                                                                                         4

                                                                                                                                                             8
                                                                                                                                                                            block length [kb]
      0
     1.0E-06   3.0E-06   1.0E-05   3.0E-05   1.0E-04   3.0E-04    1.0E-03   3.0E-03   1.0E-02
                                     Recombination Fraction

                                                                                                                                              0.25


                                                                                                                                               0.2




                                                                                                                          fraction of total
                                                                                                                                              0.15


                                                                                                                                               0.1


                                                                                                                                              0.05

                                                                                                 mild bottleneck                                 0




                                                                                                                                                                 12

                                                                                                                                                                      16

                                                                                                                                                                           20

                                                                                                                                                                                24

                                                                                                                                                                                      28

                                                                                                                                                                                            32

                                                                                                                                                                                                 36

                                                                                                                                                                                                      40

                                                                                                                                                                                                           44

                                                                                                                                                                                                                48
                                                                                                                                                     0

                                                                                                                                                         4

                                                                                                                                                             8
                                                                                                                                                                            block length [kb]
         Consequences of Haplotype Sampling Strategy

what if ascertainment is conditioned on population frequency of marker (i.e.
using only common alleles in building the haplotype)?




                                          largest block size as a function of
                                          minimum minor allele frequency
                      block

                              new block
                                                                                40




                                           average greatest block length [kb]
                                                                                         mild bottleneck   severe bottleneck
                                                                                35
                                                                                30

                                                                                25
                                                                                20

                                                                                15
                                                                                10

                                                                                5
                                                                                0
                                                                                     0                      10                 20
                                                                                          minimum minor allele frequency [%]
                                                                                                                                                                                                                                                                     Summary

• marker density and allele frequency spectrum respond
                                                                                                                                                                                                                                                                                                                  0.2                                                     0.2                                                                0.2

                                                                                                                                                                                                                                                                                                                0.15                                                     0.15                                                               0.15




                                                                                                                                                                                                                                                                                                                                                                                                                                   Pr(k)
                                                                                                                                                                                                                                                                                                        Pr(k)




                                                                                                                                                                                                                                                                                                                                                                Pr(k)
                                                                                                                                                                                                                                                                                                                  0.1                                                     0.1                                                                0.1




to population history with signals we can detect                                                                                                                                                                                                                                                                0.05

                                                                                                                                                                                                                                                                                                                       0
                                                                                                                                                                                                                                                                                                                           0   5     10      15      20    25
                                                                                                                                                                                                                                                                                                                                                                         0.05

                                                                                                                                                                                                                                                                                                                                                                               0
                                                                                                                                                                                                                                                                                                                                                                                   0       5    10       15
                                                                                                                                                                                                                                                                                                                                                                                               k differences
                                                                                                                                                                                                                                                                                                                                                                                                                20    25
                                                                                                                                                                                                                                                                                                                                                                                                                                            0.05

                                                                                                                                                                                                                                                                                                                                                                                                                                                  0
                                                                                                                                                                                                                                                                                                                                                                                                                                                      0   5        10
                                                                                                                                                                                                                                                                                                                                                                                                                                                               k differences
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            15       20    25

                                                                                                                                                                                                                                                                                                                                    k differences




                                                                                                                                                                                                                                                                                                                 0.1                                                     0.1                                                                0.1


                                                                                                                                                                                                                                                                                                                0.08                                                    0.08                                                               0.08


                                                                                                                                                                                                                                                                                                                0.06                                                    0.06                                                               0.06




                                                                                                                                                                                                                                                                                                                                                                                                                                   P(i)
                                                                                                                                                                                                                                                                                                         P(i)




                                                                                                                                                                                                                                                                                                                                                                P(i)
                                                                                                                                                                                                                                                                                                                0.04                                                                                                                       0.04
                                                                                                                                                                                                                                                                                                                                                                        0.04


                                                                                                                                                                                                                                                                                                                0.02                                                                                                                       0.02
                                                                                                                                                                                                                                                                                                                                                                        0.02


                                                                                                                                                                                                                                                                                                                  0                                                                                                                          0
                                                                                                                                                                                                                                                                                                                                                                          0
                                                                                                                                                                                                                                                                                                                           1   5      9         13    17                                                                                          1       5        9            13    17
                                                                                                                                                                                                                                                                                                                                                                               1           5     9         13    17
                                                                                                                                                                                                                                                                                                                                      i alleles                                                  i alleles                                                          i alleles




                                                                                                                                                                                                                                                                     • indications are that a common set of
 0.08
                                                                                                                                             0.4
 0.07

 0.06                                                                                                                                        0.3




                                                                                                                                                                                                                                                                     parameters will explain both of these
 0.05                                                                                                                                         0.2


 0.04
                                                                                                                                              0.1

 0.03


                                                                                                                                                                                                                                                                     observable distributions
                                                                                                                                                   0                                                                                                          16kb
 0.02                                                                                                                                               0.00                                                                                                   16 kb
                                                                                                                                                                                                                                                       12 kb
                                                                                                                                                           5.00
                                                                                                                                                                  10.00                                                                            12 kb
 0.01                                                                                                                                                                     15.00
                                                                                                                                                                                  20.00                                                  8 kb
                                                                                                                                                                                                                                             8kb

                                                                                                                                                                                          25.00                                   4 kb
                                                                                                                                                                                                  30.00
     0                                                                                                                                                                                                    35.00
                                                                                                                                                                                                                           4 kb

                                                                                                                                                                                                                  40.00
         1    2    3    4   5    6     7    8    9 10 11 12 13 14 15 16 17 18 19 20




• there is a signal of difference in different                                                                                                                                                                                                                                           0.08

                                                                                                                                                                                                                                                                                         0.06
                                                                                                                                                                                                                                                                                                                                                                         0.08

                                                                                                                                                                                                                                                                                                                                                                         0.06



sets of samples



                                                                                                                                                                                                                                                                                  P(i)




                                                                                                                                                                                                                                                                                                                                                                P(i)
                                                                                                                                                                                                                                                                                         0.04                                                                            0.04

                                                                                                                                                                                                                                                                                         0.02                                                                            0.02

                                                                                                                                                                                                                                                                                           0                                                                                       0
                                                                                                                                                                                                                                                                                                1   5      9                   13             17                                       1                5             9                    13                 17
                                                                                                                                                                                                                                                                                                                i alleles                                                                                              i alleles




                                                                                                                                                                                                                                                                     • both the commonalities and the
  0.8

                                                                Orchid-Caucasian
                                                                                                                                    40
                                                                                               average greatest block length [kb]




                                                                                                                                                       mild bottleneck            severe bottleneck
  0.6                                                           Orchid-AfrAm                                                        35




                                                                                                                                                                                                                                                                     differences have important consequences
                                                                                                                                    30

D2 0.4                                                                                                                              25
                                                                                                                                    20




                                                                                                                                                                                                                                                                     for sample design in both individual
                                                                                                                                    15
  0.2
                                                                                                                                    10

                                                                                                                                    5
    0




                                                                                                                                                                                                                                                                     association studies and public resources
                                                                                                                                    0
   1.0E-06   3.0E-06   1.0E-05   3.0E-05    1.0E-04   3.0E-04    1.0E-03   3.0E-03   1.0E-02                                             0                                         10                                     20
                                     Recombination Fraction
                                                                                                                                                        minimum minor allele frequency [%]




                                                                                                                                                                                                                                                                     such as the proposed haplotype map
                       Credits


NCBI                       Washington University
Greg Schuler               Ray Yeh
Richa Agarwala             Ruth Davenport
Eva Czabarka               Ray Miller
John Spouge                Patty Taillon-Miller
Steve Sherry               Pui Kwok

Johns Hopkins             University of Utah
David Cutler              Steve Wooding
Aravinda Chakravarti      Henry Harpending

CSHL
Ravi Sachidanandan
Lincoln Stein

								
To top