Document Sample

SNP Haplotype Estimation From Pooled DNA Samples Yaning Yang Lab of Statistical Genetics Rockefeller University (yyang@linkage.Rockefeller.edu) I. Background Genotype-phenotype Genotype-phenotype association: the central objective of genetic studies. Completion of human genome sequence is the foundation (Collin et al. 2003). Geno-phenotype Genotype Internally coded, inheritable information Need to be polymorphic (variation) Interaction with environmental factors Phenotype Outward, physical manifestation of the organism Disease status, survival times, quantitative traits (QTL) Complex traits/diseases, simple traits/diseases Simple (Mendelian) Traits Single gene, simple mode of inheritance. Huntington disease, cystic fibrosis. Method: Linkage analysis Co-segregation of marker and disease within pedigrees; based on recombination events. More than 400 simple diseases have been genetically mapped. Complex Traits Polygenic + environmental factors = Complex epistasis (interaction) & hard to dissect. Polygenic: multiple genes each with small to moderate effect. Enviromental factors: race, gender, diet etc. Cancer, diabetes, Alzheimer’s disease (AD) etc. Methods: association analysis population-based family-based Polymorphism A difference in DNA sequence of nucleotide among individuals or populations. P(minor allele)>0.1, say. Genetic mutation is polymorphism. Marker: locus-specific polymorphism Like a net (chaining) in approximation A significant marker may itself be or close to the causal genetic variant SNP: Single Nucleotide Polymorphism The most simple and common genetic polymorphism Simple: a single base mutation in DNA Common: ~90% of all human DNA variations Abundance: ~ 0.1% Biallelic (binary) ~2 million SNPs reported Genotyping SNP1 (locus 1) SNP2 (locus 2) haplotype G A Diploid T C Genotype: G/T A/C At each locus, two possible alleles, e.g, at locus 1, the two alleles can be G/G, T/T or T/G. Association Population-based Epidemiological methods: Case-control; cohort design Stratification: control over confounding factors Powerful but easily produce spurious associations due to population admixture (heterogeneity) Family-based TDT test (McNemar’s test for matched pairs) No need of stratification, true associations Sampling is costly Association Identification of causal genetic variants. Understanding their functions and disease etiology. Help for disease prevention, diagnostics, drug development (e.g.personal medicine). Allele Association (marginal) 1. Disease-allele association G T case 32 38 2 0.012 , p 0.912 control 28 30 2. Disease-genotype association G/G G/T T/T case 12 8 15 2 7.075 , p 0.029 control 6 16 7 Haplotype Association (joint) Most genome screen test one locus each time. Dependence structure (linkage disequilibrium) need to be considered. Example: Haplotypes for case,control case control Case Control ---A-----B--- ---A-----b--- ---A-----B--- 1 0 ---a------b--- ---a-----B--- ---a-----b--- 1 0 ---A----b--- 0 1 ---a-----B--- 0 1 Haplotype Total #of haplotypes is 2m for m SNPs. For example (m=3, biallelic at each position: A/a, B/b, C/c) haplotype H1 H2 H3 H4 H5 H6 H7 H8 a=b=c=0 abc Abc aBc ABc abC AbC aBC AB C A=B=C=1 000 100 010 110 001 101 011 111 Why Haplotype? LD: Alleles in Linkage disequilibrium (LD) are tightly linked and tend to be co- segregated. LD plays a fundamental role in genetic mapping of complex diseases. Haplotypes preserve LD information. Why Haplotype? Haplotype: A haplotype is a binary sequence along one chromosome. Haplotype has a block-wise structure separated by hot spots. Within each block, recombination is rare due to tight linkage and only very few haplotypes really occur. A Brief Summary Genotype-phenotype association analysis for complex disease Genetic variant/polymorphism/marker Genetic variation – human variation SNP: simple, abundant genetic variant/marker LD: dependence of markers Haplotype analysis = joint distribution II. Haplotype Estimation From Pooled DNA Introduction Key Words: Efficiency, EM Algorithm, Haplotype Frequency, LD Coefficients, Pooling, Variance estimates. Estimating haplotype frequencies from individual genotypes Individual samples are genotyped. No phase information. Likelihood analysis, (Escoffier & Slatkin,1995), but no variance estimate. Other methods: Clark’s parsimonious method (Clark 1990), Bayesian MCMC… Genotyping individual DNA Diploid ---A-----B--- haplotype ---a------b--- haplotype Genotyping A/a B/b observed genotypes (phase information is lost) Reconstruct ---A-----B--- ---a------b--- haplotype configurations or ---A-----b--- ---a------B--- Pooling: Reduce Genotyping Cost Unrelated individual samples are mixed, more ambiguities in recovering haplotypes. No individual information and no phase information. Efficient in allele frequency estimation, but is it efficient in estimating haplotype frequency? Wang et al. (2003), Ito et al. (2003). Genotyping pooled DNA ----A------B---- Pooling ----a-------b---- diploid for individual 1 ----A------b---- ----a-------B---- diploid for individual 2 Genotyping AAaa BBbb observed pool-genotypes Hap config ---A-----B--- or ---A----B--- or ---A----b--- ---A-----b--- ---A----B--- ---A----b--- ---a-----B--- ---a----b--- ---a----B--- ---a-----b--- ---a----b--- ---a----B--- Pool-genotype of K- pool X Pool-genotype: = # of allele 1. E.g. SNP 1 2 3 4 5 1 0 1 1 0 Individual 1 1 1 0 0 0 + 0 0 1 0 1 Individual 2 1 1 0 0 1 3 2 2 1 2 =Pool-genotypes An individual can be viewed as a pool of two independent chromosomes. We will say a chromosome is a ½-pool Missing values Pool-genotype at m SNP loci, X ( X 1 , X 2 ,..., X m ), X j 0, 1, ..., 2 K . Completely missing: no information. Partially missing: partial information, e.g., only know X j 1. Statistical Methods Key Words: Asymptotic variance , EM, MLE, missing data, Relative efficiency, Notations For m SNPs, each position can take two possible alleles. Denote them by 1 & 0. Totally 2 possible haplotypes. m Haplotype frequencies: h (h1, ,h2 ) m m=3: H1 H2 H3 H4 H5 H6 H7 H8 SNP 1 0 1 0 1 0 1 0 1 2 0 0 1 1 0 0 1 1 3 0 0 0 0 1 1 1 1 Maximum Likelihood Estimate Assumptions: HWE, random mating Likelihood: n L ( X , h) h( J ) J i 1 where i i {J ( H j1 , H j2 ,..., H j2 K ) : J consistent with xi } h( J ) h j1 h j2 h j2 K ˆ When K=1/2, multinomial! An Example m=2, K=2, observation X=(2,1). Consistent haplotype configurations are 1 1 h4 1 0 h2 1 0 h2 1 0 h2 J1 or J2 0 0 h1 0 1 h3 0 0 h1 0 0 h1 Likelihood = 2h h h 2h1h2 h3 2 2 1 2 4 hk=hk(0), k=1,2,…,2m EM algorithm (initial value) h( J ) E - step : pJi ) Pr(J i | obs, h) ( J ' h( J ' ) i 1 n hk hknew M - step : h new k 2nK i 1 J i ( cJ (k ) pJi ) all | hknew hk | ? NO YES cJ(k)=Number of haplotye k in END configuration J Variance Estimate Variance matrix for estimated h: I 2m 1 ˆ ˆ K W (W I X (h) W ) W 1 , for W 1 and the (k,l) element of matrix I X (h) is given by (1 k , l 2 m ) n I X (k,l) ( ( p Ji1) p Ji2) c J1 (k )cJ 2 (k ) I ( kJ1 ,l J 2 ) / hk hl i 1 J1 x J 2 x n n p Ji ) cJ (k )I ( k lJ ) / hk2 p Ji ) cJ (k )cJ (l ) I ( k ,lJ ) / hk hl , ( ( i 1 J x i 1 J x Asymptotic Variances “Fisher information” matrix: I K (h) I x (h) p ( x), p( x) h( J ) x J x Asymptotic variance of ˆ nh K W (W ' I K (h)W ) 1W ' Properties Fisher Information can be represented as I K (h) 2K E x (var( | x)) 2K11 var(E ( | x)), where =diag(1/h), ξ = # of the haplotypes, ξ ~ MultiNomial(2K, h), (a latent r.v.). 1 I K (h) 2 K K 1/ 2 2K where 1/ 2 cov( ) diag(h) hh' Reformulation of the Problem Let ξ ~ MN(2K, h), and 0-1 matrix A ( H1 , H 2 ,..., H 2m ) Genotype X can be represented as. X = A ξ (compressed info.) From the incomplete observations X, make inference on the distribution/parameter h of unobservable ξ 0.010 Simulated and estimated variances 0.010 K= 4 D '= 0 K= 3 D '= 0 .2 5 K= 2 D '= 0 .5 K= 1 D '= 0 .7 5 0.008 0.008 variances variances variance variance 0.006 0.006 0.004 0.004 0 .0 0 .2 0 .4 0 .6 1 2 3 4 D’ D' K K Variances decrease with D’, increase with K. (n=120, pa=0.4,pb=0.5) Relative efficiency of pooling efficient inefficient Relative Efficiency of Pooling Asymptotic relative efficiency (ARE) ARE ( K ) 1 / K Relative efficiency (RE) (for fixed individual number, n, V= variance or MSE) RE (K) K V1 /VK Asymptotic variances Table: 2 SNPs, pa=0.4, pb=0.5 K D’ 1/2 1 2 3 4 5 6 7 8 0.00 0.780 0.490 0.365 0.323 0.303 0.290 0.282 0.276 0.271 0.25 0.720 0.462 0.338 0.296 0.275 0.263 0.254 0.248 0.244 0.50 0.700 0.403 0.270 0.229 0.208 0.195 0.187 0.180 0.176 0.75 0.650 0.344 0.196 0.151 0.130 0.118 0.109 0.104 0.099 0.98 0.586 0.294 0.148 0.100 0.076 0.061 0.052 0.046 0.041 Asymptotic Relative Efficiency (ARE) 7 D'=0 D'=0.25 pa=0.4, pb=0.5 D'=0.5 6 D'=0.75 D'=0.98 5 Relative efficiency 4 3 2 1 1 2 3 4 5 6 7 8 K Higher LD, higher efficiency. Simulations (1,000 replicates) 2-locus (a/A, and b/B): Different choices of allele frequencies, LD coefficients, sample sizes and pool sizes: pa= 0.4, pb= 0.5 and pa = 0.2, pb = 0.3; D’=0.25, 0.5, 0.75; n=60, 120, 180; K=1,2,…,6. Simulations (1,000 replicates) 3-locus: based on real individual genotype data Infinite population: generate haplotypes according to the known haplotype frequencies, then pool 2 haplotypes to form individual genotypes, pool K individuals to form pool-genotypes. Finite population (pseudo-pooling): Randomly pooling every K individual genotypes to generate pool- genotypes. MSE of haplotype estimates Fig. Two-locus: a/A and b/B, pa=0.4, pb=0.5; n=180) MSE increases as K 0.010 D'=0.25 D'=0.5 increases. D'=0.75 0.008 For SNPs in higher LD, it is easier (less 0.006 MSE error) to estimate 0.004 haplotype frequencies. 0.002 1 2 3 4 5 6 K Relative efficiencies Fig. Two-locus: a/A and b/B, pa=0.4, pb=0.5; n=180 RE(K) increases with 3.0 D'=0.25 D'=0.5 K, but seems to level off when K 4. D'=0.75 Relative efficiency 2.5 The higher the LD, relative efficiencies the higher the 2.0 efficiency of pooling. V6 2V1 1.5 1.0 1 2 3 4 5 6 K Individual genotype data: 3-locus (data provided by Dr. Kumar) 135 unrelated individuals genotyped at 3 SNPs in the AGT gene. All the individuals are normal Caucasians. High LD: D12 0.91, D23 D13 1 ' ' ' This data set was used for Simulation according to the estimated h Pseudo-pooling simulation. Relative efficiency of pooling (3-locus) Fig. Haplotypes are generated according to h=(0, 0.082, 0, 0, 0.524,0.283, 0.005, 0.106). n= 60 n=180 n= 120 n=120 2.5 n= 180 relative efficiencies RE 2.0 n=60 1.5 1.0 1 2 3 4 5 6 K Pseudo-pooling (table) Table: Haplotype frequency estimates of the pseudo-pooling experiment based on Kumar data (n=120) Influence of missing values on MSE and RE 2.0 mis s in g r a te = 0 mis s in g r a te = 0 mis s in g r a te = 0 .0 3 0.0 07 mis s in g r a te = 0 .0 3 mis s in g r a te = 0 .0 5 mis s in g r a te = 0 .0 5 1.8 0.0 06 r elative efficie ncie s 1.6 0.0 05 MSE 1.4 0.0 04 1.2 0.0 03 1.0 1 2 3 4 5 6 1 2 3 4 5 6 K K Real Data Pool-genotypes at 10 SNPs in the AGT gene, 15 pools, each with K=2 independent individuals. Individual genotypes are not available. All individuals are unrelated. There are 2% completely missing values. Haplotype frequency estimates (7 SNPs) Case-control study Association of haplotypes and diseases. Test difference of haplotypes between case and control group. LRT test: LRT=2log(Lcase)+2log(Lcontrol)–2log(Lcase+control). Summary Pooling is efficient for m=1,2 but not for m 2 when LD is low. For m 2 and high LD case, pooling is good. The variance estimates are good if n/K is large (say, 30); otherwise, bootstrap. Summary (cont’d) Pools may have different pool sizes. The algorithm allows for different types of missing values. Can be applied to case-control design for disease-haplotype association. Need algorithms for long haplotypes. Need considering genotyping errors. References Clark A (1990) Mol. Biol. Evol. 7, 111-122. Collin F et al. (2003) Nature, 422:835-847 Excoffier L & Slatkin M (1995) Mol. Biol. Evol. 12, 921-927. Ito et al. (2003) Am . J. Hum. Genet. 72, 384-398. Wang S, Kidd KK & Zhao H (2003) Genet. Epidemiol. 24, 74-82. Yang et al. (2003) PNAS 100:7225-7230. Acknowledgement We thank Dr. A. Kumar at the New York Medical College in Valhalla for providing the individual SNP data. Thank You!

DOCUMENT INFO

Shared By:

Categories:

Tags:
allele frequency, haplotype frequencies, dna pools, allele frequencies, snp genotyping, frequency estimates, dna samples, association studies, pool sizes, haplotype analysis, genotype data, genotyping errors, statistical genetics, pool size, hum genet

Stats:

views: | 156 |

posted: | 6/22/2010 |

language: | English |

pages: | 54 |

OTHER DOCS BY frl11674

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.