An introduction to sample size and power calculations Bhaswati Ganguli, Department of Statistics, University of Calcutta. 1st December, 2009. An example Sample 1: 99 64 91 115 101 Sample 2: 119 116 97 126 114 True difference in population means is 5 Two Sample t-test t = 2.1294, df = 8, p-value = 0.06586 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval= [-1.7, 43.2] Now lets repeat this experiment 100 times In 92 out of 100 repetitions, we conclude that there is no difference in sample means. Power for comparison of 2 means. mu1 = 110 mu2 = 115 sd1 = 20 sd2 = 20 n1 =5 n2 =5 alpha = 0.05 power = 0.059 Another example Sample 1: 1 1 0 1 1 0 0 1 0 0 1 0 1 0 1 Sample 2: 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 True population Odds Ratio (OR) = 1.5 The 95% confidence interval for the OR is [0.06, 3.8] Repeat the experiment 100 times In 71 out of 100 repetitions, we conclude that the population OR is 1. Some prerequisites Parameter Test of hypothesis Power Parameter “..much confusion may arise when a word in common use is also given a technical meaning. Statistics abounds in such terms, including normal, random, variance, significant, etc.” Altman & Martin ; BMJ 1999;318:1667-1667 ( 19 June ). Variable: Information recorded about a sample of individuals Parameter: do not relate to actual measurements or attributes but to quantities defining a theoretical model. In green = Histogram showing distribution of measurements of serum albumin in 481 white men. In red = Density showing the normal distribution which fits the data most closely. Test of hypothesis A rule for deciding, based on the observed sample, whether the population parameter assumes a certain specified value. Tests of hypothesis H 0 : The mean serum albumin among white males aged over 30 is 40. H a :The mean serum albumin among white males aged over 30 is 48. H 0 : The proportion of low birth weight babies in rural India is 20%. H a : The proportion of low birth weight babies in rural India is 40%. H 0 : The OR for osteoporeosis among women as compared to men is 1.0. H a : The OR for osteoporeosis among women as compared to men is 2.0. Parameter =a single proportion Health workers wish to determine whether the rate of neonatal tetanus is decreasing. What sample size is necessary to test the null hypothesis that the population proportion is 0.15 at the 0.05 level if it is desired to have a 90% probability of detecting a decrease to a rate of 100 per thousand if that were the true proportion? Prob[ test correctly detects decrease| proportion is 0.1, Type I error = 0.05] = 0.9 n =1.645√0.15(0.85) + 1.282√0.10(0.90)2(0.05)2 = 377.90 . Hence we see that a total sample size of 378 live births would be necessary. For more details: References: Dixon and Massey (1983), Lemeshow et al. (1990), Fleiss (1981) Lachin (1981). Books containing sample size tables are available e.g. Machin and Campbell (1987); Machin et al. (1997; Lemeshow et al. (1990). Commercial and public domain software available. R Documentation for binom.confint Nine methods are allowed for constructing the confidence interval(s): Exact - Pearson-Klopper method. Asymptotic - using the Central Limit Theorem. agresti-coull - Agresti-Coull method. Wilson - Wilson method. prop.test - equivalent to prop.test(x = x, n = n, conf.level = conf.level)$conf.int. Bayes - see binom.bayes. Logit - see binom.logit. Cloglog - see binom.cloglog. Probit - see binom.probit. Profile - see binom.profile. Parameter= Relative Risk Two competing therapies for a particular cancer are to be evaluated in a multi-center clinical trial. Patients are randomized to either treatment A or B and are followed for recurrence of disease for five years following treatment. How many patients should be studied in each of the two arms of the trial in order to have 90% power to reject H0 : RR = 1 in favor of the alternative RR = 0.5, if the test is to be performed at the two-sided α = 0.05 level and it is assumed that the probability of recurrence in the placebo group= 0.35? Parameter=Odds Ratio The efficacy of BCG vaccine in preventing childhood tuberculosis is in doubt and a study is designed to compare the immunization coverage rates in a group of tuberculosis cases compared to a group of controls. Available information indicates that roughly 30% of the controls are not vaccinated, and we wish to have an 80% chance of detecting whether the odds ratio is significantly different from 1 at the 5% level. If an odds ratio of 2 would be considered an important difference between the two groups, how large a sample should be included in each study group? Additional Considerations References: Dixon and Massey (1983), Lemeshow et al. (1990), Fleiss (1981) Lachin (1981). Books containing sample size tables are available e.g. Machin and Campbell 1987; Machin et al. 1997; Lemeshow et al. 1990). Commercial and public domain software is available for sample size calculation. Fine print: May be based on normal approximation or Fishers exact test May require variance stabilisation, May require continuity corrections for values near 0 or 1 (or for small sample sizes), For a fixed total size, power will tend to be higher if sample sizes are equal Sample size calculations for the difference between two correlated proportions are based on the McNemar test. Parameter =Difference in mean values A two-group, randomized trial is planned in elderly females after hip fracture. The outcome variable will be change in hematocrit level during the study. The sample sizes in the two groups will be equal. A 5% level two-sided t test. Pilot data suggests that the standard deviation for change will be about 2.0% It would be of interest to detect a difference of 2.2% in the changes observed in placebo and treated groups. What sample size in each group would be required to achieve a power of 90% ? Issues Unequal variances: When the standard deviations in the two groups are markedly unequal, the usual t test with pooled variances is no longer the appropriate test. Transformations: Eg square root, log, Box-Cox Use if there is a pattern to the inequality (eg if groups with higher means have higher sds) If transformation does not solve the problem,it is possible that comparison of means is not the most appropriate method. If it is,a two sample t-test appropriate for a Behrens Fisher situation may be used. Issues If non-normality is an issue, Plan a large study Consider transformations Use a non-parametric procedure instead, such as the two-sample Mann-Whitney|Wilcoxon rank test. Logistic Regression with a single continuous risk factor About 30% of patients with blocked arteries followed for a year will have renewed blockage = “restenosis”. A study is to be planned to assess the effect of serum cholesterol on the likelihood of restenosis. Based on the prior results from a screening trial, mean serum cholesterol in middle-aged males is about 210 mg/dL; One standard deviation above the mean is approximately 250 mg/dL. In the screening study, the OR for the six-year death rate for these two cholesterol levels was about 1.5. The study should be large enough to detect an effect of serum cholesterol on arterial restenosis of a size similar to that seen for death rate. Logistic regression with a single continuous covariate We plan to conduct the test of the predictive effect of cholesterol level on the probability of restenosis using a 5% two-sided test and want to have 90% power to detect an odds ratio of 1.5 for values of cholesterol of 250 mg/dL versus 210 mg/dL. We set the effect size, δ =|μ1 − μ2|/σ = 0.405. The ratio of sample sizes expected to be in the no- restenosis versus the restenosis groups, r, equals 0.7|0.3 = 2.333. Variance Inflation Factor Adjusting sample size for multiple risk factors and confounders Precise sample size calculations require precise quantitative information about the interdependence structure between the covariates. We can however, use a “variance inflation factor” to adjust the sample size for the single covariate case. Variance Inflation Factor If two other covariates with a squared multiple correlation with cholesterol of 0.15 are to be entered into the logistic regression Multiply the sample size obtained for a single covariate by the variance inflation factor 1/(1 − 0.15)= 1.18, to increase the required sample size to 365. The design effect In reality we use more complex survey designs such as cluster sampling. New sample size = sample size under SRS X “Design effect” “Design effect” = 1 + d (n – 1), where d = intraclass correlation for the statistic in question n = the average size of the cluster Measurement error and sample size THE IMPACT OF DIETARY MEASUREMENT ERROR ON PLANNING SAMPLE SIZE REQUIRED IN A COHORT STUDY FREEDMAN, L.S., SCHATZKIN, A. and WAX, Y. (1990), AJE, 132 ,1185-1195. Dietary measurement error has two consequences relevant to epidemiologic studies: first, a proportion of subjects are misclassified into the wrong groups, and second, the distribution of reported intakes is wider than the distribution of true intakes. While the first effect has been dealt with by several other authors, the second effect has not received as much attention. Using a simple errors-in-measurement model, the authors investigate the implications of measurement error for the distribution of fat intake. They then show how the inference of a more narrow distribution of true intakes affects the calculation of sample size for a cohort study. The authors give an example of the calculation for a cohort study investigating dietary fat and colorectal cancer. This shows that measurement error has a profound effect on sample size requiring a six to eightfold increase over the number required in the absence of error. If the correlation coefficient between reported and true intakes is 0.65. Reliable detection of a relative risk of 1.36 beween a true intake of greater than 47.5% calories from fat and less than 25% calories from fat would require approximately one million subjects. Resource: Sample size calculator at biostat.hitchcock.org Resources in R Available from http://cran.r-project.org/ pwr: power and sample size calculations folowing Cohen (1998). asypow: power utilizing asymptotic Likelihood Ratio Methods Bayescount Bayesian Power calculations for count distributions data using MCMC Normalp: Package for exponential power distributions pammPower analysis for random effects in mixed models binomSamsize: Confidence intervals and sample size determination for a binomial proportion under simple random sampling and pooled sampling pairwiseCI: Confidence intervals for two sample comparison MBESS sample size calculations for behavioural models obtained by setting the width of the confidence intervals epiR, epicalc,powersurvEpi: sample size calculations for a variety of epidemiological designs Survey: Analysis of complex surveys HMisc, TeachingDemos: Sample size calculation and visual tools to illustrate associated concepts Genetic power calculators Purcell S, Cherny SS, Sham PC. (2003) Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics, 19(1):149-150. Sample size calculator at http://pngu.mgh.harvard.edu/~purcell/gpc/ Practical Issues For complex study designs or statistical methods, there may be no easily applied formulae or software. Use simplifications of the design Simulation Investigate whether the sample size is adequate for evaluation of secondary outcomes analyses of pre-defined subsets. Sample size values obtained from software will need to be inflated to allow for dropout or loss to follow up. All power calculations should be accompanied by sensitivity analysis. Prospective vs retrospective analysis Prospective power analyses is exploratory in nature. Retrospective analysis = After the study, we may be concerned that the statistical power of the test was low Question :Should additional information (particularly the observed effect size and variance) be used to retrospectively calculate the power of the test? Thomas, L. (1997) Retrospective power analysis. ConservationBiology, 11,276–280 Different methods may lead to different conclusions. It is unfortunate that this kind of power analysis is readily available in statistical software packages. Retrospective analyses are no substitute for the proper planning of research. Why perform/ report formal sample size calculations? Small sample size Does not imply bias Will manifest itself as large confidence intervals and lack of significance. Sample size calculations are important Guarantees adequate precision First, they specify the primary endpoint Safeguards against changing outcomes and claiming “significant” results. An alert for potential problems. Did the trial encounter recruitment difficulties? Did the trial stop early? Was a formal statistical stopping rule used?
Pages to are hidden for
"An introduction to sample size and power calculations"Please download to view full document