Inference, Confidence Intervals,
Effect sizes, and Power
Evidence of hacking into email
accounts to fix the results
√ Only six people voted.
In some countries you are only allowed
to vote after a handful of rich people
have paid lobbyists and created adverts
that insult rodentia intellect.
• What is inference?
• What are confidence intervals?
– How to make and report confidence intervals.
– A glance at bootstrapping (more in a couple of weeks)
• Touch upon hypothesis testing (more next week)
• Effect sizes (and this will continue)
• What is power?
– How to calculate and report power.
Interval versus Point Reporting
Inference: Point Estimates
• When we calculate the mean of a sample, we use that as an
estimate of the population mean μ.
• The Plug-in principle.
– Requires we believe the sample is representative of the
– Requires that the sample statistic is an unbiased (or at least
good) estimate of the population parameter.
• Some estimates are biased. The sample range underestimates the
100,000 more Iraqi dead post invasion:
Roberts et al. (2004)
– Timing??? (Thursday before US election)
Geneva convention says occupying force has responsibilities. US
General says “we don’t do body counts”. Authors argue it can be
done (they did it in 4 weeks with 7 people) and is necessary for
• Travelling important to minimize! (GPS)
• 33 clusters of 30 houses. Choose nearest 30 houses in each cluster, which
is probably not good.
• Their power analysis seems to assume non-clustered sample.
Sorry the labels are small, the point is
there is more red and dark blue after the invasion
Violent Deaths up
• But should be viewed in light of many methodological limitations.
– Authors discuss these
• 100,000 estimate is 98,000 with a 95% CI from 8,000 to 194,000 (without
(with Falluja the lower bound of the confidence interval is lower)
• This band includes most other estimates.
• Ethical problems?
What does 8,000 to 194,000 mean?
It does NOT mean that
– There is a 95% probability the number of deaths is between those
It means that
– If you repeated the survey a billion times, and making lots of
assumptions, 95% of the time the true value will be in that range.
The philosophers say we can be 95% “confident”
– Whatever that means in this context???
US and most English speaking places (UK changed in 1974) -> 1,000,000,000
Others (most of Europe, South America, Cuba, Mexico, etc.) -> 1,000,000,000,000
Constructing Confidence Intervals
• Population μ (pronounced mu)
• Estimate with sample mean ( ), the plug-in principle
• But with sampling error. Estimating the region which will usually include
the population mean.
CI 95% x t 0.05 Lots of assumptions
• Need to know df. df = n - 1 for this test. Here 94 - 1 = 93.
• t0.05 is usually about 2, but you need to look it up the t table.
Example: Newton's (1998) Hostility Data
• The mean on arrival for the 94 prisoners was 28.3 with a standard
deviation of 8.0.
• df = n-1 so 93, or about 90 for the t table
CI 95% x t0.05 x t0.05 se
CI 95% 28.3 1.99 28.3 1.99(.825)
28.3 1.6 or 26.7 to 29.9
CI 95% 28 .3 1.99 28 .3 1.6
t distribution with df = 93 and 2.5% in each tail
What does having 95% CI of 28.3 ± 1.6 mean?
We expect that about 95% of the time when a confidence interval is
made that the population mean (μ) will be within the interval created.
This allows us to be fairly "confident" that the confidence interval we
calculate contains the population mean.
It is not that there is a 95% probability that μ is within the interval.
This is a tricky concept.
Confidence intervals are a fundamental tool for the frequentist
statistician. In the long run, you should be right (i.e., μ within the
interval) about 95% of the time.
(This is a tricky concept and will be revisited)
Plotting the precision of the estimate (confidence intervals) and the spread of
the distribution (standard deviations).
Both are in units of the original variable (here in years).
95% CI: x t0.05
0 10 20 30
Years in prison
Examining the Difference between
Two Means for the Same Person
sd diff sd diff
CI 95% x1i x 2 i t 0.05 diff t 0.05
• Difference in means ± t0.05 times standard error
• Standard error of difference using estimate of the standard deviation of
• Assumption include that the difference is normally distributed (not the
individual scores ... for most tests the assumptions are about the
last week's journal question
• Just calculate a variable for the difference, and perform the calculations as
you did before.
Brewed Awakenings: http://mybrewedawakening.com/
Data from 10 people's coffee preferences.
FRESHi INSTANTi DIFFi DIFFi - (DIFFi - )2
5 3 2 1 1
this variable assumed normal
4 3 1 0 0
6 5 1 0 0
3 4 -1 -2 4
4 4 0 -1 1
5 3 2 1 1
6 3 3 2 4
3 3 0 -1 1
5 3 2 1 1
4 4 0 -1 1
Sum 45 35 10 0 14
Mean 4.5 3.5 1.0 0 sd=1.25
CI 95% 1.0 2.26 1.0 0.89 Does this allow us to say anything else?
for n=20 and sd=10, width = 9.33
Width of 95% CI
for n=20 and sd=5, width = 4.66
for n=20 and sd=2, width = 1.87
for n=20 and sd=1, width = 0.93
0 20 40 60 80 100
Confidence intervals for differences between groups
sd diff sd diff
CI 95% x1i x 2 i t 0.05 diff t 0.05
(n1 1) var1 (n2 1) var 2
(n1 1) (n2 1)
one of several
CI 95% x1 x2 t 0.05 pooled var
So, more of a pain to calculate.
How big is an effect?
• APA and all other science organizations stress the importance of
saying how large an effect is (when one is found).
• Difference in two means. Raw value. Useful.
• Correlation. Standardized. Also useful.
• Difference in means divided by some measure of spread.
Standardized. Also useful.
• In Coffee example, standard deviation of liking ratings or of
• Lots of effect size measures for different situations. Many can be
transformed into a correlation-like measure, so many people like
They arrived (60.7 ± 6.9 (stat.) ± 7.4 (sys.)) ns faster than light!
(v-c)/c = (2.48 ± 0.28 (stat.) ± 0.30 (sys.)) ×10-5
(c = 186,282 miles per s)
Calculating confidence intervals
In SPSS Explore and often as an option. Similar in R (or as a
function), but often just get the standard error (reason why
Lots of procedures print the confidence intervals or have
printing them as an option.
There used to be a single useful page that did lots. See
http://www.stat.tamu.edu/~jhardin/applets/ for several pages.
Maybe you can write R functions to do these?
How To Number 1
• Mathematics (which is what is built into SPSS)
• Computation - the bootstrap
Hypothesis Testing: The quest for p
• If p < .05 we are happy.
• Not a good philosophy of science, but how a lot of psychology (and other
disciplines) has been done.
“The almost universal reliance on merely refuting the null
hypothesis is a terrible mistake, is basically unsound, poor
scientific strategy, and one of the worst things that ever
happened in the history of psychology” (Meehl, 1978, p. 817).
• If H0 is true, 5% of the time we would reject it. This is called a Type 1
• H0 always false, so not really sure what the point of it is (more next week).
Power: 1 - β
State of the World
Decision H0 true H0 false
don’t reject H0 Type 2 error
reject H0 Type 1 error .
Probability of making a Type 2 error is conditional on the effect being a
certain size. Denoted β.
1- β is power. Convention to aim for is 80%.
Need to know the size of effect that you want to detect. Most use past
research (recommended) or Cohen’s guidelines. This is wrong!
A few ways to do it
Simulation. Set up a model of the smallest effect you want to detect,
and use "sample"
General stats programs. SPSS/PASW has an add-on (and syntax),
R has a few function.
Cohen's tables. Discussed later.
G*Power (or other specialist programs)
G*Power (Erdfelder, Faul, & Buchner, 1996, and later versions)
Lots of software out there.
In R, power.t.test, fpower,
People have written SPSS
syntax for power:
but not easy to use.
• For t test, medium
sized effect is: d 0.50
• Small is 0.20 and large is 0.80.
• If the minimum difference worth detecting is 0.50, you need
128 people in your sample to give you an 80% probability of
detecting this difference (p < .05).
• For a small effect size you need 784.
• Many surveys have shown that often the power is too low!
Medium: 64 people in each group. 128 total.
Is it really that easy?
Yes and no
The computations are a little tricky but looking up in the
tables is easy.
Understanding what to do, and getting adequate sample
sizes sometimes difficult
Thom Baguley’s (2004) critique: Positives
• Avoiding low power
• Avoiding excessive power
• Efficient planning
• Used retrospectively because SPSS prints something called
power (“fundamentally flawed”).
• Standardization and automation
• Ignoring things other than n which affect power
• Treating the effect size as the expected effect size, not the
minimum worth detecting
• Should we be rejecting interval hypotheses rather than
Journal: Why Dilbert was doomed to fail, assuming all Ratbert is claiming is
that he is different from chance.
• Confidence intervals give you all the information in a p value,
• Still, an odd thing.
• Power is one thing to take into account when deciding on the
• Do not blindly use Cohen’s conventions.
Last Week's Journal
• Take one of your peers' research statements. Generate a
causal hypothesis of interest and an associative
hypothesis of interest.
• Create a variable that is the average (i.e., the mean) of
two normally distributed variables? Is the average of two
normally distributed variables, itself normally distributed?
The sum of Normal variables is normal
The amalgamation is not (in general)
• In one sentence answer the following: Why do we
calculate the mean value for some attribute for our
• Find out how many participants you need if you want to be able to detect
an r of .05 with 80% chance, with alpha = .05. And write down the
number. How about for r = .55.
– Some do with Cohen's tables, some with
G*Power. Talk with peers.
• Play with the G*Power plots.
• Why was Dilbert, in the first frame, bound to fail?
• Suppose you have 25 variables distributed Chi-square with three degrees
of freedom and 200 people. var1 <- rchisq(200,3) makes one variable.
Look at hist(var1). Is it skewed? Add up 25 of these variables. Is this sum
skewed? Look at hist(var1 + ... + var25).
(there is a reason we are doing this, and the code is in two pages ... try
first without looking, then look)
They don't want anymore
statistics now. They want to go.
(but there is a hint for journal on the next slide)
# Here it is for 25
library(e1071) # This is for the skewness function
par(mfrow=c(1,2)) # This makes 2 graphs on 1 screen
x <- rchisq(200,3)
shapiro.test(x) # This tests normality
# Shapiro is from FIU
for (i in 2:25) x <- x + rchisq(200,3)
# Try with sum of 100 variables