# Quick Review of Hypothesis Testing

Document Sample

```					A Quick Review of Hypothesis Testing

In this lecture we will quickly review the following

 The basic one sample T test as an example
 The decision procedure
 Type I and Type II error
 OC curves and sample size selection
 Practical vs. statistical significance
 The relationship between confidence intervals and
hypothesis tests
A Hypothesis Test has 2 basic components:

1. Hypotheses:
Null Hypothesis H0
Alt. Hypothesis H1 (essentially "not H0")
e.g.
H0: =10
H1:  10
2. Decision Criteria

It works like this

Sample data --> Criteria --> reject or fail to reject H0
There are four possible outcomes from a Hyp. test:

1. Fail to reject H0 when H0 is true (we made the
correct decision)

2. Reject H0 when H0 is indeed false (right again)

3. Reject H0 when H0 is true (a Type I error)

4. Fail to reject H0 when H0 is false (a Type II error)
Define:

 = Pr{we make a Type I error}
= Pr{reject H0 | H0 true}

 = Pr{we make a Type II error}
= Pr{fail to reject H0 | H0 is false}

How good a test is is determined by these probabilities.
Example: Recall the one sample T-test

Assumptions:

1. Population is NID (, 2),
2.  and  are unknown population parameters

H0: =0 where 0 is some specified constant
H1:   0
Decision criteria:

Compute the T statistic from the data:

X  0
T0  S
n

If |T0| > Tc then reject H0

(Tc is a “critical” value from a table)
Why does this work?

1. The test statistic "measures something significant"
about H0. If the sample average is far from the
hypothesized value, H0 is likely to be false.

2. We know the distribution of T0 when H0 is true.

In this example, the test statistic T0 will be close to zero
when H0 is true.
Conversely, T0 will be far from zero when H0 is false.
Thus it measures something significant about H0.

If T0 is far enough away from zero, we can reject H0.
But how far away from zero is far enough to reject H0?

That's why we need the distribution of T0 when H0 is
true.

[picture of T distribution]
For example, when H0 is true, and we have n=11
observations, then

Pr{-3.196 < T0 < +3.196} = 0.01      (from T-tables)

Thus I set my criteria to be: "reject H0 if |T0| > 3.196"

Then I only have a 0.01 probability of a type I error
Thus to summarize the test procedure:

Take a sample of n observations X1, X2, ..., Xn

Compute the sample average

Compute the sample standard deviation

Compute T0

If |T0| > T(, n-1) then reject H0
"P-values" of tests

We can actually report results 2 ways:

1. State  ahead of time, and report if we reject H0 or not.

2. After analysis, state the value of  which is on
the border of reject and do not reject.

This is the significance level, or "P-value" of the test.
But what about Type II error and

 = Pr{fail to reject H0 given H0 is true}

A. We can always trade off  and :

e.g. reject H0 if |T0| >0 (always reject)

Then  = 0 (never fail to reject a false H0)
and =1 (always reject a true H0)

Conversely......
B.  depends on "how false" H0 is

Example: H0: =10

Pr{ fail to reject H0 | =10.0001} is different from

Pr{ fail to reject H0 | =999999.0}

C. For a constant , there is only one way to
decrease . Who can guess what it is?
Given the true value of , (say 1) the distribution of the
test statistic T0 is also known. (It is called the "non-
central T distribution").

We can get  probabilities from it. Someone did and put
the results in the back of many stats books.

Thus given, 0, 1, , and n, we can find .
More directly, let  = |1 - 0|/

Given , n, , we can find  using the OC curve.

Note that  is unknown, and therefore must be
estimated from data by S.
Example of using OC curves:

Suppose  = 1 n=10      1=1 0=2

note that H0: =0 is false since 1=1 not 2

Find given that we have set  to 0.05
Example 2:

Suppose |1 - 0| = 1 represents a big difference
from a practical point of view.

You decide that you want to test the hypothesis that
H0: =0 and run a risk of type I error of 0.05

Also, you decide that you want to detect if the true mean
differs from 0 by 1 or more with high probability.

Specifically you want  = 0.1. How big should n be?
Summary of determining sample size:

1. Decide on an appropriate value of 

2. Have at least an estimate of 

3. Specify a practically significant difference from
0 that you want to be able to detect.

4. Specify the value 1-  with which you want to
detect it.
IMPORTANT POINT

There is a big difference between :

 Statistically significant
 Practically significant

For example, you can state that the means of two
populations are different:

H0: mean pop1 = mean pop2 --> reject H0
If you take very large samples, even very small
differences in means can be detected.

Indeed the true population means could be
100.00001 and 100.0002 (in  units if you insist)

If you took a big enough sample, you could
reject H0 and run only a small risk of (type 1) error.
For example suppose we are interested in the mean IQ
scores for the two populations "men" and "women".....

This is a classic misuse of statistics: Showing statistical
significance and implying practical significance.
The relationship between confidence intervals and
Hyp Tests:

For every Hyp test there is a CI, and visa versa:

In a previous example we found that for n=11:

Pr{ -3.169 < T0 < 3.169 } = 0.99

With a little algebra, we can change this to a CI as
follows:
(X   )
Pr{3.169           3.169}  0.99
S
 Pr{3.169S  ( X   )  3.169S }  0.99
 Pr{3.169S  X     3.169S  X }  0.99
 Pr{3.169S  X    3.169S  X }  0.99
Conversely, suppose we wanted to test

H0: =10 at =0.99

We could first form a 99% CI on .

If the CI contains 10, Then the hyp. test will fail to
reject H0.
Examples of T- tests

Single Sample Two-sided T-Test

Tennis balls must have the correct "bounciness".
Specifications dictate that when dropped from 200cm
onto concrete, the ball should rebound to a height of
150cm.
15 "Brand X" balls are obtained and tested. The results
are:

Xbar=152.18 cm           S2 = 16.63

H0:  = 150
H1:   150
We decide that  = 0.05 is an appropriate Type I error
risk.

From the T-tables, Tc = t/2,n-1 = t0.025,14 = 2.145

Next we compute T0 = (152.18 - 150)/Sqrt(16.63/15) =
2.0704

Since | 2.0704 | < 2.145, we cannot reject H0:  = 150
Does this mean H0 is true?

NO!

What it means is that the true unknown mean  is
sufficiently close to 150 that our experiment was not able
to tell the difference!
Suppose that if the true mean is 2 or more cm away from
150cm, then the balls are inadequate for sale. We want
to know the following:

Suppose the true mean was 2 cm away from 150. What
is the probability that our experiment will reject H0?

Our measure of "how false H0 is:
 = |  - 0 | /  = 2/Sqrt(16.63) = 0.49042

Note:  is estimated by Sqrt(S2) here
From the OC curve, Pr{Accept H0 }=   0.6

Our test is insufficient for the desired purpose!
What to do? We need a larger sample size. How large?

Suppose we want to detect a difference of 2 cm with
probability =0.9. What sample size do we need?

From the OC Curves for the two-sided T test, we need 40
to 50 observations.
Note that Accepting H0, a.k.a. failing to reject H0 is the
"weak conclusion". It does not mean the H0 is true.
Rather it means" we do not have enough data to reject H0.

Rejecting H0 is the strong conclusion. If we reject H0
then we have proven (statistically) that the mean is not
150. Proven in this case means that there is less than a
0.05 probability that we are wrong.
One-sided Tests

Suppose that all we care about is that the ball rebounds at
least 150cm.

We would like to prove (statistically) that brand X meets
this criterion. We thus set up a one-sided hypothesis test.

H0:  = 150
H1:  > 150

As H1 is the strong conclusion, it offers strong evidence
thus allowing us to rest easy.
The T-statistic for this test is the same:

T0 = (Xbar - 0)/Sqrt(S2/n)
= (152.18 - 150)/Sqrt(16.63/15) = 2.0704

Note that if T0 is negative, then Xbar < 150.

Surely we do not want to reject H0 in favor of H1!

Only if T0 is sufficiently large enough will we reject H0.
The one-sided criteria in this case is reject H0 if:

T0 > Tc = t,n-1 = t0.05,14 = 1.761

As this is true, we can say that  > 150, and run a 0.05
risk of being wrong.
Two Sample Tests: The Pooled T test

Now suppose we have two populations and we want to
see if they have the same mean.

Example: Exxon claims their gas gives better gas
mileage than Shell's gas. We want to try and prove it.
We get 40 drivers, and let 20 use Exxon gas and 20 use
Shell gas, each recording their gas mileage.
We set up a one-sided hypothesis test in order to prove
the claim:

H0: E = S
H1: E > S
The results are as follows:

XbarE = 25.3     XbarS = 21.4

S2E = 15.1       S2S = 14.8

Assuming that the two populations have the same
variance 2, the Two Sample T statistic is:
T0 = (XbarE - XbarS) / [Sp*Sqrt(1/nE + 1/nS)]

Where nE and nS are the number of people using Exxon
and Shell

and S2p = [(nE - 1) S2E + (nS - 1) S2S] / (nE + nS - 2)
S2p is an estimate of the variance 2. It is basically a
weighted average of the two sample S2 values.

To be confident that H1: E > S is true, clearly T0 needs
to be large enough. Our decision criteria here is

Reject H0 if T0 > Tc = t, nE + nS - 2 = t0.05,38  1.684
Calculating T0 yields:
S2p = [(20 - 1)(15.1) + (20 - 1) (14.8)] / (20 + 20 - 2)

= (286.9 + 281.2)/38 = 14.95 Sqrt(14.95) = 3.866

T0 = (25.3 - 21.4) / [3.866*Sqrt(1/20 + 1/20)] = 3.19
Since 3.19 > 1.684, we reject H0 and conclude Exxon
does indeed offer better gas mileage.

Notes: This test assumes the populations have equal
variance, and then pools S2 from each sample to estimate
the variance 2. Thus the name "pooled t-test".

If we cannot assume equal variance, there are other
approximate methods.

This test assumes all (40) observations are independent.
Also observe the decision criteria:
Reject H0 if T0 > Tc = t, nE + nS - 2 = t0.05,38  1.684

The value nE + nS - 2 is called the "degrees of freedom"
of the test.

Note from the T tables that as the degrees of freedom
increase, the crtical value Tc decreases thus making the
strong conclusion (reject H0) more easily accomplished.
Paired T-test.

Consider the previous example. We had 40 people with
40 cars split into 2 groups of 20. Perhaps a better idea is
to take 20 people (and their cars) and have each person
try both Exxon and Shell gas.

We would get 20 observations on each type of gas.
Unfortunately the observations are not independent, and
the pooled t-test cannot be applied.
Here is what we can do: For each subject i, i=1,…,20

Let di = the difference in mileage between Exxon and
Shell for that customer.
We now have 20 observations on di.

We set up the following hypothesis test:

H0: d = 0
H1: d > 0
If the mean difference is >0 then Exxon offers better
mileage!

Also note that we can apply the one sample t-test here on
the di.

Reject H0 if T0 > Tc = t,n-1 where n=20 in this example

It seems reasonable that we would get a smaller variance
using this particular design. That is we expect:

S2d < S2p because we are comparing apples with apples.

However, we are loosing degrees of freedom (from 38 to
19)

Use the Paired T if you expect a significant reduction in
variance, otherwise use pooled T

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 33 posted: 11/5/2008 language: English pages: 49