SNP Haplotype Estimation From Pooled DNA Samples

Document Sample
SNP Haplotype Estimation From Pooled DNA Samples Powered By Docstoc
					SNP Haplotype Estimation
From Pooled DNA Samples
          Yaning Yang

    Lab of Statistical Genetics
      Rockefeller University

   (yyang@linkage.Rockefeller.edu)
I. Background
Genotype-phenotype
                Genotype-phenotype
                 association: the
                 central objective
                 of genetic studies.

                Completion of
                 human genome
                 sequence is the
                 foundation (Collin et
                 al. 2003).
      Geno-phenotype

 Genotype
     Internally coded, inheritable information
     Need to be polymorphic (variation)
     Interaction with environmental factors
 Phenotype
     Outward, physical manifestation of the organism
     Disease status, survival times, quantitative traits (QTL)
 Complex traits/diseases, simple traits/diseases
  Simple (Mendelian) Traits

 Single gene, simple mode of inheritance.
 Huntington disease, cystic fibrosis.
 Method: Linkage analysis
     Co-segregation of marker and disease within
      pedigrees; based on recombination events.
     More than 400 simple diseases have been
      genetically mapped.
 Complex Traits
 Polygenic + environmental factors = Complex
  epistasis (interaction) & hard to dissect.
     Polygenic: multiple genes each with small to
      moderate effect.
     Enviromental factors: race, gender, diet etc.
 Cancer, diabetes, Alzheimer’s disease (AD) etc.
 Methods: association analysis
     population-based
     family-based
    Polymorphism
   A difference in DNA sequence of
    nucleotide among individuals or
    populations.
   P(minor allele)>0.1, say.
   Genetic mutation is polymorphism.
   Marker: locus-specific polymorphism
        Like a   net (chaining) in approximation
        A significant marker may itself be or close to the
         causal genetic variant
SNP: Single Nucleotide
Polymorphism
                 The most simple and
                  common genetic
                  polymorphism
                 Simple: a single base
                  mutation in DNA
                 Common: ~90% of all
                  human DNA variations
                 Abundance: ~ 0.1%
                 Biallelic (binary)
                 ~2 million SNPs reported
  Genotyping
          SNP1 (locus 1)     SNP2 (locus 2)




                                                 haplotype
                G                  A
Diploid
                T                  C


Genotype:     G/T                A/C
At each locus, two possible alleles, e.g, at locus 1,
the two alleles can be G/G, T/T or T/G.
        Association
   Population-based
       Epidemiological methods: Case-control; cohort design
       Stratification: control over confounding factors
       Powerful but easily produce spurious associations due
        to population admixture (heterogeneity)
   Family-based
       TDT test (McNemar’s test for matched pairs)
       No need of stratification, true associations
       Sampling is costly
Association
   Identification of causal genetic variants.
   Understanding their functions and
    disease etiology.
   Help for disease prevention,
    diagnostics, drug development
    (e.g.personal medicine).
    Allele Association (marginal)
1. Disease-allele association
               G     T
   case    32      38
                                   2  0.012 , p  0.912
   control 28      30

2. Disease-genotype association
            G/G   G/T    T/T
   case    12     8      15          2  7.075 , p  0.029
   control 6      16     7
    Haplotype Association (joint)
   Most genome screen test one locus each time.
   Dependence structure (linkage disequilibrium)
    need to be considered.
   Example: Haplotypes for case,control
                                                 case control
Case               Control
---A-----B---    ---A-----b---   ---A-----B---    1    0
---a------b---   ---a-----B---
                                 ---a-----b---    1    0
                                 ---A----b---     0    1
                                 ---a-----B---    0    1
     Haplotype

   Total #of haplotypes is 2m for m SNPs.
   For example (m=3, biallelic at each
    position: A/a, B/b, C/c)
    haplotype   H1 H2   H3 H4 H5    H6 H7 H8
    a=b=c=0     abc Abc aBc ABc abC AbC aBC AB
                                            C
    A=B=C=1     000 100 010 110 001 101 011 111
Why Haplotype?
   LD:
       Alleles in Linkage disequilibrium (LD)
        are tightly linked and tend to be co-
        segregated.
       LD plays a fundamental role in genetic
        mapping of complex diseases.
       Haplotypes preserve LD information.
Why Haplotype?
   Haplotype:
       A haplotype is a binary sequence along
        one chromosome.
       Haplotype has a block-wise structure
        separated by hot spots.
       Within each block, recombination is
        rare due to tight linkage and only very
        few haplotypes really occur.
    A Brief Summary
   Genotype-phenotype association analysis for
    complex disease
   Genetic variant/polymorphism/marker
        Genetic variation – human variation
   SNP: simple, abundant genetic
    variant/marker
   LD: dependence of markers
   Haplotype analysis = joint distribution
II. Haplotype Estimation
From Pooled DNA
Introduction

  Key Words:
  Efficiency, EM Algorithm, Haplotype
  Frequency, LD Coefficients, Pooling,
  Variance estimates.
    Estimating haplotype frequencies from
    individual genotypes

   Individual samples are genotyped.
   No phase information.
   Likelihood analysis, (Escoffier &
    Slatkin,1995), but no variance estimate.
   Other methods: Clark’s parsimonious
    method (Clark 1990), Bayesian MCMC…
  Genotyping individual DNA
 Diploid           ---A-----B---    haplotype
                   ---a------b---   haplotype


Genotyping          A/a B/b         observed genotypes
                                    (phase information is lost)

Reconstruct        ---A-----B---
                   ---a------b---   haplotype configurations
              or
                   ---A-----b---
                   ---a------B---
    Pooling: Reduce Genotyping Cost

   Unrelated individual samples are mixed, more
    ambiguities in recovering haplotypes.
   No individual information and no phase
    information.
   Efficient in allele frequency estimation, but is it
    efficient in estimating haplotype frequency?
    Wang et al. (2003), Ito et al. (2003).
  Genotyping pooled DNA
             ----A------B----
Pooling      ----a-------b----   diploid for individual 1
             ----A------b----
             ----a-------B----   diploid for individual 2


Genotyping    AAaa BBbb            observed pool-genotypes


Hap config   ---A-----B--- or ---A----B--- or ---A----b---
             ---A-----b---    ---A----B---    ---A----b---
             ---a-----B---    ---a----b---     ---a----B---
             ---a-----b---     ---a----b---    ---a----B---
Pool-genotype of K- pool
                     X
   Pool-genotype:       = # of allele 1. E.g.
    SNP 1     2   3       4   5
         1    0   1       1    0     Individual 1
         1    1   0       0    0
                                          +
         0    0   1       0    1
                                     Individual 2
         1    1   0       0    1
         3    2   2       1    2   =Pool-genotypes
   An individual can be viewed as a pool of two
    independent chromosomes.
    We will say a chromosome is a ½-pool
Missing values

   Pool-genotype at m SNP loci,
      X  ( X 1 , X 2 ,..., X m ), X j  0, 1, ..., 2 K .
   Completely missing: no information.

   Partially missing: partial information,
    e.g., only know X j  1.
Statistical Methods

 Key Words:
 Asymptotic variance , EM, MLE,
 missing data, Relative efficiency,
Notations
   For m SNPs, each position can take two
    possible alleles. Denote them by 1 & 0.
   Totally 2 possible haplotypes.
             m


   Haplotype frequencies: h  (h1, ,h2 )           m


   m=3:
             H1   H2   H3   H4   H5   H6   H7   H8
     SNP 1   0    1    0    1    0    1    0    1
         2   0    0    1    1    0    0    1    1
         3   0    0    0    0    1    1    1    1
    Maximum Likelihood Estimate
   Assumptions: HWE, random mating
   Likelihood:           n                    
                     L ( X , h)     h( J ) 
                                        J    
                                  i 1         
    where
                                              i




      i  {J  ( H j1 , H j2 ,..., H j2 K ) : J consistent with xi }
     h( J )  h j1 h j2    h j2 K
            ˆ

   When K=1/2, multinomial!
    An Example
   m=2, K=2, observation X=(2,1). Consistent
    haplotype configurations are
          1    1  h4                1    0  h2
                                           
          1    0  h2                1    0  h2
     J1                 or     J2  
            0   0  h1                  0   1  h3
                                           
          0    0  h1                0    0  h1
                                           

    Likelihood = 2h h h  2h1h2 h3
                          2            2
                        1 2 4
            hk=hk(0), k=1,2,…,2m                       EM algorithm
               (initial value)



                                                  h( J )
E - step : pJi )  Pr(J   i | obs, h) 
            (

                                               J ' h( J ' )
                                                        i




                         1        n                              hk  hknew
   M - step : h new
                k          
                        2nK i 1 J i
                                                (
                                       cJ (k ) pJi )



            all | hknew  hk |  ?             NO


                            YES
                                               cJ(k)=Number of
                                               haplotye k in
                      END
                                               configuration J
     Variance Estimate
Variance matrix for estimated h:
                                                                         I 2m 1 
      ˆ               ˆ     
       K  W (W I X (h) W ) W                    1      
                                                              , for W    1 
                                                                                 
                                                                                 

and the (k,l) element of matrix I X (h) is given by (1  k , l  2 m )

                       n
     I X (k,l)                   (      (
                                   p Ji1) p Ji2) c J1 (k )cJ 2 (k ) I ( kJ1 ,l J 2 ) / hk hl
                   i 1 J1 x J 2  x
          n                                               n
         p Ji ) cJ (k )I ( k lJ ) / hk2    p Ji ) cJ (k )cJ (l ) I ( k ,lJ ) / hk hl ,
              (                                      (

         i 1 J  x                                     i 1 J  x
 Asymptotic Variances

“Fisher information” matrix:

     I K (h)   I x (h) p ( x), p( x)     h( J )
                x                          J  x




Asymptotic variance of                ˆ
                                     nh

              K  W (W ' I K (h)W ) 1W '
  Properties
Fisher Information can be represented as
     I K (h)  2K   E x (var( | x))
            2K11   var(E ( | x)),
where  =diag(1/h), ξ = # of the haplotypes,
ξ ~ MultiNomial(2K, h), (a latent r.v.).
                                   1
           I K (h)  2 K   K     1/ 2
                                  2K
where   1/ 2  cov( )  diag(h)  hh'
    Reformulation of the Problem

   Let ξ ~ MN(2K, h), and 0-1 matrix
              A  ( H1 , H 2 ,..., H 2m )
   Genotype X can be represented as.
              X = A ξ (compressed info.)
   From the incomplete observations X, make
    inference on the distribution/parameter h of
    unobservable ξ
                0.010
                        Simulated and estimated variances




                                                                        0.010
                                                  K=    4                           D '=   0
                                                  K=    3                           D '=   0 .2 5
                                                  K=    2                           D '=   0 .5
                                                  K=    1                           D '=   0 .7 5
                0.008




                                                                        0.008
                                                            variances
variances
     variance




                                                                 variance
                0.006




                                                                        0.006
                0.004




                                                                        0.004




                        0 .0      0 .2    0 .4   0 .6                           1      2                3   4


                                         D’
                                         D'                                                         K
                                                                                                    K




                               Variances decrease with D’, increase with K.
                               (n=120, pa=0.4,pb=0.5)
        Relative efficiency of pooling


efficient




inefficient
    Relative Efficiency of Pooling

   Asymptotic relative efficiency (ARE)
           ARE ( K )  1 /  K

   Relative efficiency (RE) (for fixed
    individual number, n, V= variance or
    MSE)
             RE (K)  K  V1 /VK
       Asymptotic variances
Table: 2 SNPs, pa=0.4, pb=0.5

                                    K
D’     1/2    1      2          3   4   5   6    7     8
0.00   0.780 0.490 0.365 0.323 0.303 0.290 0.282 0.276 0.271
0.25   0.720 0.462 0.338 0.296 0.275 0.263 0.254 0.248 0.244
0.50   0.700 0.403 0.270 0.229 0.208 0.195 0.187 0.180 0.176
0.75   0.650 0.344 0.196 0.151 0.130 0.118 0.109 0.104 0.099
0.98   0.586 0.294 0.148 0.100 0.076 0.061 0.052 0.046 0.041
    Asymptotic Relative Efficiency (ARE)
                        7




                                D'=0
                                D'=0.25
                                              pa=0.4, pb=0.5
                                D'=0.5
                        6




                                D'=0.75
                                D'=0.98
                        5
  Relative efficiency

                        4
                        3
                        2
                        1




                            1   2         3       4       5    6   7   8
                                                      K


Higher LD, higher efficiency.
     Simulations (1,000 replicates)

   2-locus (a/A, and b/B):
        Different choices of allele frequencies, LD
        coefficients, sample sizes and pool sizes:
       pa= 0.4, pb= 0.5 and pa = 0.2, pb = 0.3;
       D’=0.25, 0.5, 0.75;
       n=60, 120, 180;
       K=1,2,…,6.
    Simulations (1,000 replicates)
   3-locus: based on real individual genotype
    data
       Infinite population:
        generate haplotypes according to the
        known haplotype frequencies, then pool 2
        haplotypes to form individual genotypes, pool K
        individuals to form pool-genotypes.
       Finite population (pseudo-pooling):
        Randomly pooling every K individual genotypes to
        generate pool- genotypes.
                  MSE of haplotype estimates
  Fig. Two-locus: a/A and b/B,
  pa=0.4, pb=0.5; n=180)

                                                    MSE increases as K
      0.010




                  D'=0.25
                  D'=0.5
                                                

                                                    increases.
                  D'=0.75
      0.008




                                                   For SNPs in higher
                                                    LD, it is easier (less
      0.006
MSE




                                                    error) to estimate
      0.004




                                                    haplotype
                                                    frequencies.
      0.002




              1        2    3       4   5   6
                                K
                                               Relative efficiencies

                              Fig. Two-locus: a/A and b/B, pa=0.4,
                              pb=0.5; n=180
                                                                                 RE(K) increases with
                                     3.0




                                                D'=0.25
                                                D'=0.5                            K, but seems to level
                                                                                  off when K  4.
                                                D'=0.75
Relative efficiency
                                     2.5




                                                                                 The higher the LD,
             relative efficiencies




                                                                                  the higher the
                                     2.0




                                                                                  efficiency of pooling.
                                                                                  V6  2V1
                                     1.5




                                                                              
                                     1.0




                                           1         2    3       4   5   6
                                                              K
Individual genotype data: 3-locus
(data provided by Dr. Kumar)

       135 unrelated individuals genotyped at 3
        SNPs in the AGT gene.
       All the individuals are normal Caucasians.
        High LD: D12  0.91, D23  D13  1
                      '          '   '


       This data set was used for
         Simulation according to the estimated h
         Pseudo-pooling simulation.
                 Relative efficiency of pooling
                 (3-locus)
                        Fig. Haplotypes are generated according to
                        h=(0, 0.082, 0, 0, 0.524,0.283, 0.005, 0.106).


                                  n= 60    n=180
                                  n= 120
                                                                    n=120
                        2.5




                                  n= 180
relative efficiencies
         RE
                        2.0




                                                                        n=60
                        1.5
                        1.0




                              1        2        3         4         5          6
                                                     K
       Pseudo-pooling (table)

Table: Haplotype frequency estimates of the pseudo-pooling experiment based
on Kumar data (n=120)
      Influence of missing values on MSE and RE




                                                                                   2.0
                                                                                             mis s in g r a te = 0
                   mis s in g r a te = 0
                                                                                             mis s in g r a te = 0 .0 3
      0.0 07




                   mis s in g r a te = 0 .0 3
                                                                                             mis s in g r a te = 0 .0 5
                   mis s in g r a te = 0 .0 5




                                                                                   1.8
      0.0 06




                                                        r elative efficie ncie s

                                                                                   1.6
      0.0 05
MSE




                                                                                   1.4
      0.0 04




                                                                                   1.2
      0.0 03




                                                                                   1.0




               1   2           3            4   5   6                                    1   2           3            4   5   6

                                     K                                                                         K
Real Data

   Pool-genotypes at 10 SNPs in the AGT gene,
   15 pools, each with K=2 independent individuals.
   Individual genotypes are not available.
   All individuals are unrelated.
   There are 2% completely missing values.
Haplotype frequency estimates
(7 SNPs)
    Case-control study

   Association of haplotypes and diseases.
    Test difference of haplotypes between
    case and control group.
   LRT test:
    LRT=2log(Lcase)+2log(Lcontrol)–2log(Lcase+control).
    Summary
    Pooling is efficient for m=1,2 but not
     for m  2 when LD is low.
    For m  2 and high LD case, pooling is
     good.
    The variance estimates are good if n/K
     is large (say,  30); otherwise,
     bootstrap.
Summary (cont’d)
   Pools may have different pool sizes.
   The algorithm allows for different
    types of missing values.
   Can be applied to case-control design
    for disease-haplotype association.
   Need algorithms for long haplotypes.
   Need considering genotyping errors.
    References
   Clark A (1990) Mol. Biol. Evol. 7, 111-122.
   Collin F et al. (2003) Nature, 422:835-847
   Excoffier L & Slatkin M (1995) Mol. Biol. Evol.
    12, 921-927.
   Ito et al. (2003) Am . J. Hum. Genet. 72,
    384-398.
   Wang S, Kidd KK & Zhao H (2003) Genet.
    Epidemiol. 24, 74-82.
   Yang et al. (2003) PNAS 100:7225-7230.
Acknowledgement

We thank Dr. A. Kumar at the New York
Medical College in Valhalla for providing
the individual SNP data.


            Thank You!