Powerpoint

Statistical challenges in the design and analysis of whole genome

You must be logged in to download this document
Reviews
Shared by: sammyc2007
Stats
views:
35
downloads:
1
rating:
not rated
reviews:
0
posted:
3/31/2008
language:
English
pages:
0
Statistical challenges in the design and analysis of whole genome association studies Jenny Barrett Division of Genetic Epidemiology University of Leeds May 2006 Outline • Introduction • Genetic association – direct and indirect • Whole genome association (WGA) studies • Statistical issues • • • • • Choice of genetic polymorphism (SNPs) Study design Quality control Statistical analysis Interpretation of results – multiple testing • Conclusions/future prospects The Human Genome • Human genome is diploid, meaning we have two copies of each chromosome (one from each parent) • Human haploid genome consists of 3 x 109 base pairs arranged along chromosomes • There are several types of variation (polymorphism) in the genome, including base-pair repeat sequences that are highly variable between people • Sites in the genome where many individuals differ at a single base are called single nucleotide polymorphisms (SNPs) ...AGCTA... ...AACTA... Finding disease genes Previously two main study design options for searching for disease genes: • Linkage analysis (c 400 highly variable markers spanning the genome; looking for regions of DNA shared by affected relatives within families) • Association with candidate genes (polymorphisms within specific genes of biological interest – e.g. DNA repair genes for cancers. Are specific variants more common in cases than controls?) Genetic association Association between a genetic variant and disease can arise in two main ways: • Direct association: the variant alters the function of the gene and hence increases the chances of the individual developing the disease (a causal variant) • Indirect association: the variant is correlated with a causal genetic variant nearby on the same chromosome Linkage disequilibrium • When a genetic variant first arises through mutation, it is perfectly correlated with nearby variants • Over many generations, this correlation will gradually become weaker • When genetic variants are correlated in a population they are said to be in linkage disequilibrium (LD) • Nearby loci are more likely to be in stronger LD, although the relationship is complex Palmer and Cardon Lancet 2005 • HapMap Builds upon sequencing of human genome (Human Genome Project) • Launched in 2002 to study patterns of human sequence variation across the genome and create a public database • Based on 269 individuals from 4 populations • Results confirm that there are commonly long segments of strong LD in the genome HapMap project and public databases WGA studies • Potential for greater power from association compared with linkage has long been recognised • Association studies rely on LD rather than linkage; extends over much smaller regions • Hence a huge number of markers needed to cover the genome; only recently become technically feasible • 10.4 million human SNPs are now listed in the public SNP database dbSNP • Only a proportion of these SNPs needs to be genotyped – enough to capture most genetic variation through LD “WGA” studies to date • Myocardial infarction, 60k SNPs (Ozaki et al, 2002) • Age-related macular degeneration, 116k SNPs and only 96 cases, 50 controls (Klein et al, 2005) • Parkinson Disease, 200k SNPs, 443 discordant sib-pairs in phase 1, 300 casecontrol pairs in phase 2 (Maraganore et al, 2005) • Obesity, 87k SNPs (Herbert et al, 2006) Coronary Artery Disease (CAD) Wellcome Trust Case Control Consortium studying 7 diseases: CAD, Type 1 diabetes, Type 2 diabetes, Rheumatoid arthritis, Inflammatory bowel disease, Bipolar disorder, Hypertension CAD cases are drawn from the British Heart Foundation Family Heart Study of 2000 affected sibling pair families and GRACE study of 1000 affected and unaffected sibling pairs • National studies conducted in Leeds (and Leicester) Two phase study • Phase 1: CAD study design • Phase 2 • 1,000 patients with each disease from across GB • 3,000 controls (1500 1958 birth cohort, 1500 blood donors from across GB) • Further 1000 cases to be genotyped with 2 to 5% of SNPs selected on basis of results from all disease groups • All 2000 cases will be compared for these SNPs against the 3000 controls Melanoma On behalf of Genomel (international melanoma genetics consortium) we recently received European Commission funding for a WGA study of melanoma. Melanoma cases and controls will be drawn from samples already collected across the Genomel consortium. Two (or more) phase study • Phase 1: Melanoma study design • Phase 2 (under development): • 1000 patients with melanoma from UK, mainland Europe and Australia • Cases preferentially selected for family history of disease, but excluding those with known mutations in melanoma gene CDKN2A in family • 1000 population-based controls • Up to 4000 cases and 4000 controls will be studied for a subset of polymorphisms from phase 1 • Polymorphisms in candidate genes will also be genotyped Outline • Introduction • Genetic association – direct and indirect • Whole genome association (WGA) studies • Statistical issues • • • • • Choice of genetic polymorphism (SNPs) Study design Quality control Statistical analysis Interpretation of results – multiple testing • Conclusions/future prospects Statistical issues: 1. Choice of SNPs Main criteria • Call rate (missing data may not be missing completely at random) Effect of missing data • Failure to call genotypes is generally not independent of genotype, resulting in biased estimates of genotype frequencies using only called samples • Differential bias between cases and controls is quite likely under some study designs • Clayton et al (2005) observed general inflation of the test statistics for association in study of 6,000 SNPs. Much of this was attributable to differential call rates between cases and controls Statistical issues: 1. Choice of SNPs Main criteria • Call rate (missing data may not be missing completely at random) • Accuracy (check using duplicates and families) • Coverage (how well are ungenotyped SNPs covered through LD?) • Cost and practicality SNPs for CAD study • 675,000 SNPs across the genome • 500k Affymetrix commercial chip • 175k Perlegen custom chip • Affymetrix • 93-98% call rate quoted; 99.5-99.9% reproducibility • Perlegen • Selected based on Hapmap, to maximise capture of common variation • By adding these SNPs the coverage improves substantially • Considering 500k Affymetrix chip OR the Illumina 317k chip • Illumina • Selected SNPs for good coverage • 99.9% call rate quoted; 99.9-100% reproducibility SNPs for melanoma study • Pilot study currently underway in our consortium to compare call rate, reproducibility and evidence of Mendelian inconsistencies in 8 samples consisting of two nuclear families 2. Study design • Two stage design for purposes of economy SNPs showing no evidence of association in Phase 1 are dropped from Phase 2 • Analysis strategies: • Power is almost always greater if joint analysis is carried out (Skol et al, 2006) • Replication: Phase 1 is seen as hypothesis generating. Final analysis of Phase 2 adjusting for number of SNPs in Phase 2 • Joint: Final analysis of whole data set adjusting for total number of SNPs Parameters determining power Sample size Genetic model Minor allele frequency % of individuals in Phase 1 % of SNPs carried over to Phase 2 Number of SNPs () Analysis strategy LD between marker and disease polymorphisms (coverage from SNPs) Melanoma study Sample size 4000 cases, 4000 controls Genetic model GRR 1.3, multiplicative Minor allele frequency 0.01 to 0.5 % of individuals in Phase 1 25 % of SNPs carried over to Phase 2 1 to 5 Number of SNPs () 317,000 (1.6x10-7) Analysis strategy Joint or replication LD between marker and disease polymorphisms (coverage from SNPs) 0 0 20 40 60 80 .1 .2 p One stage Replication, 5% .3 .4 Joint, 5% .5 Multiplicative model with genotypic relative risk 1.3 0 0 .1 .2 p One stage Replication, 1% .3 .4 Joint, 1% .5 Multiplicative model with genotypic relative risk 1.3 0 0 20 40 60 80 .1 .2 p One stage Joint, 5% .3 Joint, 1% .4 .5 Multiplicative model with genotypic relative risk 1.3 Choice of SNPs for Phase 2 • SNP selection based on evidence for disease association from phase 1 results • Methods to date have assumed SNPs will be ranked in terms of test-statistic (P-values) • Other factors to consider: • Effect size or minor allele frequency • Degree of LD with other “associated” SNPs • Prior expectations from external data (previous evidence of linkage/association, function of SNP) 3. Quality control • • • • • Call rate by SNP (in cases and controls) Call rate by sample Accuracy between duplicate samples Mendelian inconsistencies in families Hardy Weinberg equilibrium in controls Discard various SNPs and samples Mark SNPs as “suspect” Hardy-Weinberg Equilibrium In a random mating population, genotype frequencies soon achieve equilibrium Genotype Frequency AA p2 AC 2p(1-p) CC (1-p)2 • Departures from HWE are tested using a 2 goodness-of-fit test or exact test based on the distribution of the number of heterozygotes (AC) conditional on the observed number of copies of minor allele • One explanation of lack of equilibrium is genotype error Computational issues • Analysis needs to be well planned because of the time taken to do even simple analyses • We have recently invested in specialist hardware and database software to manage the data and carry out simple analyses (BC Platforms in Finland) • With 3000 subjects, 500,000 SNPs on each subject, HWE tests for all SNPs can be calculated in just over 2 hours; odds ratio and odds ratio confidence intervals in less than 6 hours • Links from database to R where we can develop more complex customised analyses 4. Analysis of association • For phase 1: • SNP-by-SNP analysis • Haplotype analysis (joint analysis of neighbouring correlated SNPs) may be considered too Data for one SNP Cases Controls Total AA AT TT n12 n11 n10 n02 n01 n00 n*2 n*1 n*0 Pearson 2 test with 2 degrees of freedom (2df) Armitage test for trend in proportions (2 with 1 df) “Metastatistics” have been considered: e.g. both tests must pass some threshold Total n1* n0* n** Final analysis • SNP-by-SNP analysis • Haplotype analysis • Since distinct genes may not influence disease risk independently we plan to also develop more complex models • Marchini et al (2005) suggest that analysis of all 2-locus models may increase power, despite penalties of multiple testing • Alternative approach is to focus on specific hypotheses based on biological pathways • Gene-gene interaction • Gene-environment interaction (UV exposure) 5. Interpretation of results • Non-trivial because of large number of tests. Further complications are (i) two-stage design (ii) correlation between tests • Calculation of empirical p-values computationally too expensive • We have assumed that to achieve a type 1 error rate of 0.05, use significance level of 0.05/m at final analysis, where m is total number of markers • Wang et al (2006) show that nominal significance level for final analysis should be slightly less stringent to allow for fact that markers have passed stage 1 Correlations between tests • Dudbridge and Koeleman (2004) developed procedures relying on less computationallyexpensive permutation that are based on a combined statistic (sum of k best –log P values) from each permutation • Extreme value distribution is fitted to these statistics • Observed statistic is compared with distribution under null Cautionary tales • “Best” results may be artefactual (Clayton et al, 2005) • Need for replication on a different genotyping platform and eventually functional studies • Parkinson’s Disease WGA study (Maraganore et al, 2005) found one SNP significant at P=7.6x10-6. Several research groups have since tried and failed to replicate this and other findings from this study Conclusions and prospects • Despite the challenges, there is optimism that WGA studies will represent a step forward in understanding disease aetiology • WGAs will provide a resource for development and investigation of more complex disease models • Current studies do not really cover the whole genome • Eventually may be able to sequence cheaply – more statistical challenges! • Rare variants not covered • Power low to detect small effects Acknowledgements Genomel: Julia Newton-Bishop WTCCC: Alistair Hall, Nilesh Samani, Stephen Ball Genetic Epidemiology: Tim Bishop Statistics group: Mark Iles
Related docs
Other docs by sammyc2007
top 10 secrets for tree trimming
Views: 19  |  Downloads: 1
The mantel is a favourite place to decorate
Views: 8  |  Downloads: 0
Some tips for doing holiday decorating quickly
Views: 12  |  Downloads: 0
Simple Pine Cone Ornaments
Views: 11  |  Downloads: 0
Polish Christmas decorations
Views: 8  |  Downloads: 0
Last Minute Merry Christmas Decorating Tips
Views: 7  |  Downloads: 0
Hot Tips For Cool Holiday Decor
Views: 11  |  Downloads: 0