An introduction to sample size and power
Department of Statistics,
University of Calcutta.
1st December, 2009.
Sample 1: 99 64 91 115 101
Sample 2: 119 116 97 126 114
True difference in population means is 5
Two Sample t-test
t = 2.1294, df = 8, p-value = 0.06586
alternative hypothesis: true difference in means is
not equal to 0
95 percent confidence interval= [-1.7, 43.2]
Now lets repeat this experiment 100 times
In 92 out of 100
repetitions, we conclude
that there is no
difference in sample
Power for comparison of 2 means.
mu1 = 110
mu2 = 115
sd1 = 20
sd2 = 20
alpha = 0.05
power = 0.059
Sample 1: 1 1 0 1 1 0 0 1 0 0 1 0 1 0 1
Sample 2: 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0
True population Odds Ratio (OR) = 1.5
The 95% confidence interval for the OR is [0.06, 3.8]
Repeat the experiment 100 times
In 71 out of 100
conclude that the
population OR is 1.
Test of hypothesis
“..much confusion may arise when a word in common
use is also given a technical meaning. Statistics abounds
in such terms, including normal, random, variance,
Altman & Martin ; BMJ 1999;318:1667-1667
( 19 June ).
Variable: Information recorded about a sample of
Parameter: do not relate to actual measurements or
attributes but to quantities defining a theoretical model.
In green = Histogram showing distribution of
measurements of serum albumin in 481 white men.
In red = Density showing the normal distribution which
fits the data most closely.
Test of hypothesis
A rule for deciding, based on the observed sample,
whether the population parameter assumes a certain
Tests of hypothesis
H 0 : The mean serum albumin among white males aged over 30 is 40.
H a :The mean serum albumin among white males aged over 30 is 48.
H 0 : The proportion of low birth weight babies in rural India is 20%.
H a : The proportion of low birth weight babies in rural India is 40%.
H 0 : The OR for osteoporeosis among women as compared to men is 1.0.
H a : The OR for osteoporeosis among women as compared to men is 2.0.
Parameter =a single proportion
Health workers wish to determine whether the rate of
neonatal tetanus is decreasing.
What sample size is necessary to test the null hypothesis
that the population proportion is 0.15 at the 0.05 level if
it is desired to have a 90% probability of detecting a
decrease to a rate of 100 per thousand if that were the
Prob[ test correctly detects decrease| proportion is 0.1,
Type I error = 0.05] = 0.9
n =1.645√0.15(0.85) + 1.282√0.10(0.90)2(0.05)2 =
Hence we see that a total sample size of 378 live births
would be necessary.
For more details:
References: Dixon and Massey (1983),
Lemeshow et al. (1990),
Books containing sample size tables are available e.g.
Machin and Campbell (1987);
Machin et al. (1997;
Lemeshow et al. (1990).
Commercial and public domain software available.
R Documentation for binom.confint
Nine methods are allowed for constructing the confidence interval(s):
Exact - Pearson-Klopper method.
Asymptotic - using the Central Limit Theorem.
agresti-coull - Agresti-Coull method.
Wilson - Wilson method.
prop.test - equivalent to prop.test(x = x, n = n, conf.level = conf.level)$conf.int.
Bayes - see binom.bayes.
Logit - see binom.logit.
Cloglog - see binom.cloglog.
Probit - see binom.probit.
Profile - see binom.profile.
Parameter= Relative Risk
Two competing therapies for a particular cancer are to
be evaluated in a multi-center clinical trial. Patients
are randomized to either treatment A or B and are
followed for recurrence of disease for five years following
How many patients should be studied in each of the two
arms of the trial in order to have 90% power to reject
H0 : RR = 1 in favor of the alternative RR = 0.5, if the
test is to be performed at the two-sided α = 0.05 level
and it is assumed that the probability of recurrence in
the placebo group= 0.35?
The efficacy of BCG vaccine in preventing childhood
tuberculosis is in doubt and a study is designed to
compare the immunization coverage rates in a group of
tuberculosis cases compared to a group of controls.
Available information indicates that roughly 30% of the
controls are not vaccinated, and we wish to have an
80% chance of detecting whether the odds ratio
is significantly different from 1 at the 5% level.
If an odds ratio of 2 would be considered an important
difference between the two groups, how large a sample
should be included in each study group?
References: Dixon and Massey (1983), Lemeshow et al.
(1990), Fleiss (1981) Lachin (1981).
Books containing sample size tables are available e.g. Machin
and Campbell 1987; Machin et al. 1997; Lemeshow et al.
Commercial and public domain software is available for sample
May be based on normal approximation or Fishers exact test
May require variance stabilisation,
May require continuity corrections for values near 0 or 1 (or for
small sample sizes),
For a fixed total size, power will tend to be higher if sample sizes
Sample size calculations for the difference between two correlated
proportions are based on the McNemar test.
Parameter =Difference in mean values
A two-group, randomized trial is planned in elderly females
after hip fracture.
The outcome variable will be change in hematocrit level during
The sample sizes in the two groups will be equal.
A 5% level two-sided t test.
Pilot data suggests that the standard deviation for change
will be about 2.0%
It would be of interest to detect a difference of 2.2% in the
changes observed in placebo and treated groups.
What sample size in each group would be required to achieve a
power of 90% ?
Unequal variances: When the standard deviations in the
two groups are markedly unequal, the usual t test with
pooled variances is no longer the appropriate test.
Eg square root, log, Box-Cox
Use if there is a pattern to the inequality
(eg if groups with higher means have higher sds)
If transformation does not solve the problem,it is
possible that comparison of means is not the most
If it is,a two sample t-test appropriate for a Behrens
Fisher situation may be used.
If non-normality is an issue,
Plan a large study
Use a non-parametric procedure instead, such as the
two-sample Mann-Whitney|Wilcoxon rank test.
Logistic Regression with a single
continuous risk factor
About 30% of patients with blocked arteries followed for a
year will have renewed blockage = “restenosis”.
A study is to be planned to assess the effect of serum
cholesterol on the likelihood of restenosis.
Based on the prior results from a screening trial, mean
serum cholesterol in middle-aged males is about 210
One standard deviation above the mean is approximately
In the screening study, the OR for the six-year death rate
for these two cholesterol levels was about 1.5. The study
should be large enough to detect an effect of serum
cholesterol on arterial restenosis of a size similar to that
seen for death rate.
Logistic regression with a single
We plan to conduct the test of the predictive effect of
cholesterol level on the probability of restenosis using a
5% two-sided test and want to have 90% power to
detect an odds ratio of 1.5 for values of cholesterol of
250 mg/dL versus 210 mg/dL.
We set the effect size, δ =|μ1 − μ2|/σ = 0.405.
The ratio of sample sizes expected to be in the no-
restenosis versus the restenosis groups, r, equals
0.7|0.3 = 2.333.
Variance Inflation Factor
Adjusting sample size for multiple risk factors
Precise sample size calculations require precise
quantitative information about the
interdependence structure between the
We can however, use a “variance inflation
factor” to adjust the sample size for the single
Variance Inflation Factor
If two other covariates with a squared multiple
correlation with cholesterol of 0.15 are to be entered into
the logistic regression
Multiply the sample size obtained for a single covariate
by the variance inflation factor 1/(1 − 0.15)= 1.18, to
increase the required sample size to 365.
The design effect
In reality we use more complex survey designs such as cluster
New sample size = sample size under SRS X “Design effect”
“Design effect” = 1 + d (n – 1),
where d = intraclass correlation for the statistic in question
n = the average size of the cluster
Measurement error and sample size
THE IMPACT OF DIETARY MEASUREMENT ERROR ON PLANNING
SAMPLE SIZE REQUIRED IN A COHORT STUDY
FREEDMAN, L.S., SCHATZKIN, A. and WAX, Y. (1990), AJE, 132 ,1185-1195.
Dietary measurement error has two consequences relevant to epidemiologic studies: first,
a proportion of subjects are misclassified into the wrong groups, and second, the
distribution of reported intakes is wider than the distribution of true intakes. While the
first effect has been dealt with by several other authors, the second effect has not
received as much attention. Using a simple errors-in-measurement model, the authors
investigate the implications of measurement error for the distribution of fat intake. They
then show how the inference of a more narrow distribution of true intakes affects the
calculation of sample size for a cohort study. The authors give an example of the
calculation for a cohort study investigating dietary fat and colorectal cancer. This shows
that measurement error has a profound effect on sample size requiring a six to
eightfold increase over the number required in the absence of error. If the
correlation coefficient between reported and true intakes is 0.65. Reliable detection of a
relative risk of 1.36 beween a true intake of greater than 47.5% calories from fat and less
than 25% calories from fat would require approximately one million subjects.
Resource: Sample size calculator at biostat.hitchcock.org
Resources in R
Available from http://cran.r-project.org/
pwr: power and sample size calculations folowing Cohen (1998).
asypow: power utilizing asymptotic Likelihood Ratio Methods
Bayescount Bayesian Power calculations for count distributions data
Normalp: Package for exponential power distributions
pammPower analysis for random effects in mixed models
binomSamsize: Confidence intervals and sample size determination
for a binomial proportion under simple random sampling and pooled
pairwiseCI: Confidence intervals for two sample comparison
MBESS sample size calculations for behavioural models obtained by
setting the width of the confidence intervals
epiR, epicalc,powersurvEpi: sample size calculations for a variety of
Survey: Analysis of complex surveys
HMisc, TeachingDemos: Sample size calculation and visual tools to
illustrate associated concepts
Genetic power calculators
Purcell S, Cherny SS, Sham PC. (2003) Genetic Power
Calculator: design of linkage and association genetic
mapping studies of complex traits. Bioinformatics,
Sample size calculator at
For complex study designs or statistical methods, there
may be no easily applied formulae or software.
Use simplifications of the design
Investigate whether the sample size is adequate for
evaluation of secondary outcomes
analyses of pre-defined subsets.
Sample size values obtained from software will need to
be inflated to allow for dropout or loss to follow up.
All power calculations should be accompanied by
Prospective vs retrospective analysis
Prospective power analyses is exploratory in nature.
Retrospective analysis = After the study, we may be concerned
that the statistical power of the test was low
Question :Should additional information (particularly the
observed effect size and variance) be used to retrospectively
calculate the power of the test?
Thomas, L. (1997) Retrospective power analysis. ConservationBiology, 11,276–280
Different methods may lead to different conclusions.
It is unfortunate that this kind of power analysis is readily available
in statistical software packages.
Retrospective analyses are no substitute for the proper
planning of research.
Why perform/ report formal sample size
Small sample size
Does not imply bias
Will manifest itself as large confidence intervals and lack of
Sample size calculations are important
Guarantees adequate precision
First, they specify the primary endpoint
Safeguards against changing outcomes and claiming “significant”
An alert for potential problems.
Did the trial encounter recruitment difficulties?
Did the trial stop early?
Was a formal statistical stopping rule used?