Often a set of data is collected, or an experiment carried out, not simply with a view to
summarising the results and estimating suitable parameters but rather in order to test an idea. This
idea or hypothesis may arise by purely theoretical reasoning, or it may be suggested by the results
of earlier experiments.

brief overview:
The way statistics is used to set up our hypothesis to test is a little strange. First we start with
what is called the “Null hypothesis.” This is the assumption that there is no effect of e.g.
experimental treatment, difference in conditions etc. We test this against an alternative
hypothesis: that is the hypothesis we are attempting to support with our data. Generally we hope
that our data shows sufficient differences from the expectations of the null hypothesis to reject it
and so accept our alternative hypothesis. E.g. from null hypothesis we expect no effect of drug
upon heart rate. Our data shows an increase. If that increase is sufficiently large then we may
conclude that no, the null hypothesis was wrong, there is an effect of this drug which does cause
an increase in heart rate.
(Not always the case we hope for a difference, one may hope that there is no difference.—we can
show there is no effect. Eg. tobacco company may wish to show that smoking their cigarettes
does not cause an in increase of a certain type of cancer. Rather than hope to reject the null
hypothesis, we may hope to be able to “fail to reject” the null hypothesis.)
         We then use a statistical test to calculate the probability of observing a difference as large
as that obtained or larger given the null hypothesis is true. If the probability is less than some
specified level then we reject the null hypothesis and accept the alternative.

Null hypothesis
The notation commonly used to represent the null hypothesis is Ho, and that of the alternative
hypothesis Ha (or H1). However, you do not often see these explicitly written in scientific papers.
You do sometimes see “the hypothesis we wish to test is…..” However during your research it is
very useful to state the null hypothesis as you would see it in statistical textbooks

Start by assuming there is no effect. What then would you expect? Would write something like:

Ho: µd - µp = 0                           where “d” and “p” represent drug and placebo

[more often written as above rather than Ho: µd = µp but they are the same]
[note we do not use sample parameters but population parameters.]

may be something like: Ho: ρ = 0              [no correlation]
may be something like: Ho: µ = 2 vs. Ha: µ ≠ 2

If the null hypothesis is rejected then we need an alternative hypothesis to fall back on. (Your
expectations or hypothesis being tested.) This dichotomy is denoted:

Ho: µd - µp = 0 vs. Ha: µd - µp ≠ 0

or it might be something like:

Ho: µd - µp = 0 vs. Ha: µd < µp

Differences between these depends upon your expectations. For instance, if you were developing
a drug that reduces heart rate (beta blocker) then may hope µd < µp. Alternatively, you may wish
to show that a change, any change, in heart as a side effect of the drug is not present in which
case µd - µp ≠ 0 is more appropriate.
[We will get back to these differences.]

Level of significance
Next thing to do is set our level of significance. This is generally 5% or 0.05 from historical
reasons but is essentially arbitrary. [However, there are reasons why me might not want to use
0.05 which we will come back to later. [Bonferroni]]
0.05 = 1/20 = From one sentence in one paper, but undoubtedly based upon experience. Can I get
quote on this?

What is the level of significance? This is the chance we are willing to reject the null hypothesis
given it is true. 0.05 equals 1/20 chance. That is, there will always be natural variation and so
there may be some differences between what we would expect if the null hypothesis were true
and the data we observed. For instance, if we toss a coin 100 times we would expect and 50 heads
and 50 tails. If we got 51 heads and 49 tails we would rightly assume that this is probably not an
effect of bias but simply natural variation.
         The level of significance sets how stringent we will be about any differences. If it is very
important that we do not make a mistake and reject the null hypothesis when it is true then we set
a very low level of significance, e.g. perhaps we will only accept 1/1000 chance of being wrong.

The level of significance is our preset maximum (critical) level of rejection. It is usually denoted
as α.

Conduct a statistical test and this gives us a p-value. What is a p-value?


The p-value represent the probability of observing a difference as large as that obtained or larger
given the null hypothesis is true.

The probability, computed assuming that Ho is true, that the test statistic would take a value as
extreme or more extreme than that actually observed is called the p-value of the test. The smaller
the p-value, the stronger the evidence against Ho provided by the data.

example: our level of significance is 0.05. We are only willing to accept a 1/20 chance that we
may reject the null hypothesis with this given the null hypothesis is true. Our test gives us a p-
value of 0.01. This means there is only a 1/100 chance that the difference we saw in the observed
and expected data WAS BY CHANCE ALONE. This chance, 1/100, is below our critical level
1/20 and so we are confident we have made the right decision to reject the null hypothesis and
accept the alternative.

We have some difference in results, X. This gives us a p value of 0.02. We are confident that
there is enough evidence to reject the null hypothesis. However, had we had a more extreme
difference, X+c, then we would be even more confident to reject the null hypothesis.

“as extreme or more extreme” simply means if we can reject the null hypothesis with degree of
difference, then any greater differences we might see automatically will cause the null hypothesis
to be rejected.

Statistical significance
If the p-value is as small or smaller than α, we say that the data are statistically significant at
level α.
(The term “significant” was introduced by English statistician Francis Y Edgeworth in 1885 as
meaning “corresponds to a real difference in fact.”)

if α = 0.05. If p <= 0.05, we reject the null hypothesis and accept the alternative hypothesis.

if α = 0.05. If p >= 0.05, we fail to reject the null hypothesis (= “accept” null hypothesis as being

large p means large probability that the difference you saw was due to chance (so we fail to reject
null hypothesis)
low p means small probability that the difference you saw was due to chance and thus there is
high chance that there really is an effect of the experimental treatment (and so we reject null
hypothesis in favour of alternative hypothesis).

Terminology                      Probability                z
significant                      5%              *          1.96
highly significant               1%              **         2.58
very highly significant          0.1%            ***        3.29

*. = 0.05 > p > 0.01
** = 0.01 > p > 0.001
*** = p < 0.0001
usually it will be specified in the study what the number of asterisks refer to.

Concept of p-value is confusing—bit of a double negative: fail to reject null hypothesis.
Technically we should not say accept the alternative hypothesis because there may be other
alternative hypotheses that fit the data better than the null hypothesis.

Bit like Popperian method of falsification (Logic of Scientific Discovery). Cannot prove
something is true but we can certainly prove it is not true. E.g. cannot prove all swans are white.
We would need to present evidence for every possible swan. However, the presence of a single
black swan is sufficient to disprove the theory. Example 2: prove the dodo is not extinct. Could
only conclusively show the opposite, a dodo survives. Same with stats. It is difficult to prove that
your hypothesis is the best possible hypothesis. However, you can prove the opposite that there is
not no effect.

Testing our hypothesis:
1) Specify null hypothesis (Ho) and alternative hypothesis (Ha). The test is designed to assess
    the strength of evidence against Ho. Ha is the statement we will accept if the evidence
    enables us to reject Ho.

2) (optional) Specify the significance level α. This states how much evidence against Ho we
   will regard as decisive. Normally this will be 5%
3) Calculate the value of the test statistic on which the test will be based. This is a statistic that
   measures how well the data conform to Ho.
4) Find the p-value for the observed data. This is the probability calculated assuming that Ho is
   true, that the test statistic will weigh against Ho at least as strongly as it does for the observed
   data. If the p-value is less than or equal to α, the test result is statistically significant at level

Type I and type II error.
P-value is not a measure of effect but the risk you take of rejecting the null hypothesis given the
fact that it is true. The point at which you are willing to lose if you are wrong.
With α=0.05 there is 1/20 chance that you might reject the null hypothesis when it is true. This
error is known as a type I error.

                                          reject Ho         fail to reject Ho
null hypothesis true                      type I (α)        correct decision
null hypothesis false                  correct decision         type II (β)

[1-β = power of the test, but we will return to this later.]

Some situations when you might want to reduce type I errors, and therefore use smaller α, e.g.
0.01. For instance, you might want to make decision about whether to invest 500 million marks
into developing new promising drug. Would not want to make mistake that you reject the null
hypothesis—implying that it does it work—when the results you got were by chance.

You can easily reduce type I errors by altering α

You can reduce type II errors by increasing the sample size which reduce the variance of the
sample mean.

One tailed and two tailed tests
Earlier I stated that the steps are:
1) set α
2) calculate test statistic
3) look up p-value

p-value does not necessarily correspond to α in tables. It depends upon alternative hypothesis,
namely whether they are one or two tailed tests.

if Ha: µd - µp ≠ 0 then we are not worried which direction the difference is in, simply that there is
some difference.

if Ha: µd > µp then we are explicitly stating which direction the expected difference is in, a large
difference in the opposite direction will cause us to accept the null hypothesis in the same way as
a small difference.

The former is a two tailed test. We test the test statistic at α/2 in the tables.
This latter is a one-tailed statistic. We test the test statistic at α in the tables.

This is partly why you need to be very clear about setting down your null and alternative
hypotheses as it can alter the results at the final stage, looking up p-values.

For instance, in minitab when using a 2 sample t-test—a test to look at difference in means—you
must specify whether it is one or two tailed by choosing right option:
less than
not equal [default]
greater than
for the alternative hypothesis.

Z scores
Can formulate all above examples in terms of z scores which ultimately is how statistical
packages will calculate them:
For 2-tailed tests: The given value of z has been picked at random from N(0,1). Test: inspect the
value of z, and if it is less than –1.96 or greater than 1.96 (i.e. if |z|>1.96) reject the null
hypothesis at the 5% level of significance. If |z| > 2.58, reject the null hypothesis at the 1% level,
and if |z| > 3.29 reject the null hypothesis at the 0.1% level.
Values for single tailed tests: 95%          1.645; 99% 2.326; 99.9% 3.09

To top