# Hypothesis Testing

Document Sample

```					    Econ 413
Parks
Hypothesis Testing
Page 2 of 20                               Econ 413                         Hypothesis Testing

Hypothesis Testing

A statistical hypothesis is a set of assumptions about a model of observed data.
Example 1: The number of heads of 11 coin flips are random and distributed as a
binomial with probability 0.5 and n=11 (recall the binomial distribution has two
parameters, the probability of success and the number of trials).
Example 2: Income is distributed as a normal random variable with mean μ and
variance σ2
Example 3: Y = a + b*X + ε and the seven classical assumptions are true.
Example 1 completely specifies the distribution of the data (number of heads). It
is called a simple hypothesis. Examples 2 and 3 have unknown parameters and so do
not completely specify the distribution of the data. They are called complex hypotheses.
A statistical hypothesis test is a decision about a statistical hypothesis. The
decision is to accept or reject one hypothesis versus another. In order to make a
statistical hypothesis test, one needs to specify two hypotheses: the maintained and the
alternate. Either or both can be simple or complex.
Most books call one hypothesis the null hypothesis. I have three reasons to use
the word maintained rather than null:
1.        Null is defined as amounting to nothing, having no value, and being 0
(among other definitions). Often the labeling of the null hypothesis is
H0 and I suppose that null hypothesis was used in preference to zero
hypothesis or naught hypothesis. Still null is not a good descriptive
word.
2.        You have learned things about the null hypothesis which may or may
not be true. Using maintained hypothesis starts us off on a neutral path.
3.        Maintained hypothesis may, I hope, remind you that the maintained
hypothesis usually has many assumptions. Some books do use
maintained hypothesis to include all but one assumption and then use
null for the ‘last’ assumption. I find that distinction confusing.
A statistical hypothesis test specifies a critical region – a set of numbers. If the
observed data is in the critical region, then reject the maintained hypothesis. If the
observed data is NOT in the critical region, then accept the maintained hypothesis.
Example 1 test: Let the critical region be 0, 1, 2, 9, 10, 11 heads. If you flip the
coin 11 times, reject the maintained hypothesis that the number of heads is a binomial
distribution with probability .5 of heads and n=11 if you observer 0, 1, 2, 9, 10, 11 heads.
Accepting the maintained hypothesis does not prove it to be true and rejecting the
maintained hypothesis does not prove it to be false. Similarly, accepting the alternate
hypothesis does not prove it to be true and rejecting the alternative does not prove it to be
false. A statistical test can prove nothing.
I believe that many authors use 'fail to reject' so that students will not think that
something was proved with a statistical hypothesis. But the only meaning that 'fail to
reject' can have in statistical hypothesis testing is accept. The outcome of a statistical
Page 3 of 20                              Econ 413                        Hypothesis Testing

hypothesis test is BINARY – only two outcomes. The wording 'fail to reject'
connotatively conveys something different than 'accept' because in English we often use a
double negative to convey something other than a binary outcome.
There are only two outcomes of a statistical test. The data is either in the critical
region or it is not in the critical region. If the data is NOT in the critical region, you
accept the maintained. You reject the alternative. Reject the alternative must mean
accept the maintained. Fail to reject the maintained must mean accept the maintained.
If 'fail to reject' had any real meaning other than accept, then 'fail to accept'
would also have a different meaning. Now you would have four outcomes: accept the
maintained, reject the maintained, fail to reject the maintained, fail to accept the
maintained. But there are only two outcomes of a statistical test: the data is either in the
critical region or it is not in the critical region. The only outcomes are to accept the
maintained (reject the alternative) or accept the alternative (reject the maintained). Fail to
reject must mean accept and fail to accept must mean reject.
There is a second reason not to use 'failed to reject'. The wording taints the
discussion with a connotation that you are trying to reject (and failed). Whether you want
to accept or reject a statistical hypothesis is outside of the discussion of statistical
hypotheses. Want is a normative concept. I never use 'failed to accept' (except in
moments of brain failure). You should strive to do so also. I never want to accept or
reject a hypothesis unless someone is paying me money, reputation, or other reward
(which then makes me want). You will not want to accept or reject a hypothesis in this
course. Your grade does not depend on whether the hypothesis is accepted or rejected,
but rather on what you do with that acceptance or rejection.
For some statistical hypotheses, there are many tests. Homoscedasticity, non
serially correlated errors and even normality of a random variable are all a hypothesis
with many different tests. In some sense, saying ‘fail to reject’ means that the current test
accepted the maintained but some other test might reject the maintained. Then ‘fail to
reject’ is not about a hypothesis test, but about many hypothesis tests. But in such a case
there are many critical regions (as many as there are tests). ACCEPT or REJECT is
about one single critical region. We will discuss distinguishing among hypothesis tests,
but we will not use ‘fail to reject’. I never use fail to reject and never use fail to
accept.
If accepting a hypothesis does not prove the hypothesis, then what does it do?
Acceptance allows one to proceed as if the hypothesis were true. But there are two
outcomes: Accepting a true hypothesis and accepting a false hypothesis. Accepting a
true hypothesis would be a 'correct' decision while rejecting a true hypothesis would be
an 'incorrect' decision. Rather than call it incorrect, we call it an error.
Type I and Type II errors
There are two types of errors because there are two hypotheses (maintained and
alternative):
1.         Type I error: Reject a true maintained hypothesis = accept a false
alternative hypothesis. Recall by definition, reject the maintained
Page 4 of 20                               Econ 413                         Hypothesis Testing

(accept the alternative) occurs if and only if observed data is in the
critical region.
2.        Type II error: Reject a true alternative = accept a false maintained
hypothesis. Recall by definition, accept the maintained (reject the
alternative) occurs if and only if the observed data is NOT in the critical
region.
In classical statistical hypothesis testing a hypothesis is true or false. Hypotheses
do not have a probability of being true or false. But there is a probability of making a
Type I error and a probability of making a Type II error.
The probability of making a Type I error is the probability that the data is in the
critical region conditional upon assuming the maintained hypothesis.
The probability of making a Type II error is the probability that the data is NOT
in the critical region conditional upon assuming the alternative hypothesis.
Example 1: The critical region is 0, 1, 2, 9, 10, 11. The probability of 0, 1, 2, 9,
10, or 11 heads occurring given the number of heads is a Binomial (0.5, 11) is 0.0005 +
0.0054 + 0.0269 + 0.0269 + 0.0054 + 0.0005 = 0.0654. These probabilities are obtained
from using the Excel function BINOMDIST – e.g., for two heads I used
=BINOMDIST(2,11,0.5,FALSE) . The probability of a Type I error for this critical region is
0.0654. It is the probability that we observe 0, 1, 2, 9, 10, or 11 heads in 11 flips
assuming the flips are a binomial distribution with n=11 and p=0.5. If we observe 0, 1, 2,
9, 10, or 11 heads we reject the maintained hypothesis and accept the alternative
hypothesis. If we observe 3, 4, 5, 6, 7, or 8 heads we accept the maintained hypothesis
and reject the alternative hypothesis.
What is the alternative hypothesis? In this example 1, and in most real situations,
the alternative hypothesis is the negation of the maintained hypothesis. In this example 1,
the alternative hypothesis is the number of heads was not generated according to a
binomial distribution with probability 0.5 and n=11. We know that there were 11 trials.
We do not know whether it was a binomial distribution and we do not know whether the
probability was 0.5.
One alternative is the data was generated by a different distribution. For example,
the data could have been generated by flipping the coin until 2 heads were observed and
it took 11 trials. That distribution (flipping until a certain number of successes is
observed) is called the negative binomial.
Another more common alternative hypothesis in example 1 is that the distribution
is binomial but the probability is not 0.5.
Unless the alternative hypothesis is specified we can not know what the
probability of a Type II error (reject a true alternative) is. In most real life cases, the
alternative is complex and hence the probability of Type II error is unknown.
Page 5 of 20                                                                                                                               Econ 413                                           Hypothesis Testing

Sometimes we can calculate the probability of a Type II error. In example 1, if
we specify the alternative is a binomial distribution, then we can calculate the probability
of Type II error for each probability from 0 to 1. The following table does this for seven
different critical regions.

Critical Region
1.000
1                     2         3     4                                                 5                       6                         7
CR 2 0.012
0,1,2,9,10,11

0.900
0,1,10,11

8,9,10,11
CR                                                                                                                                                                     0.800

9,10,11
0,1,2,3

1,3,7,9
0,1,2

0.700

0.600
P(Type I) 0.065 0.012 0.033 0.113 0.113 0.033 0.274
0.500
3,4,5,6,7,8,9,10,11

0,2,4,5,6,7,8,10,11
CR 1 0.065
4,5,6,7,8,9,10,11

0,1,2,3,4,5,6,7,8

0.400
2,3,4,5,6,7,8,9

0,1,2,3,4,5,6,7

0.300
3,4,5,6,7,8

0.200

0.100

0.000
0.1    0.090                0.303                   0.090                 0.019               1.000                 1.000                   0.545
alternative hypotheses

0        0.2       0.4        0.6        0.8        1
0.2    0.383                0.678                   0.383                 0.161               1.000                 1.000                   0.541
P(Type II) various

0.3    0.687                0.887                   0.687                 0.430               0.996                 0.999                   0.632
0.4    0.875                0.969                   0.881                 0.704               0.971                 0.994                   0.721                               Probability of Type II error for two critical
0.5    0.935                0.988                   0.967                 0.887               0.887                 0.967                   0.726                               regions and various binomial parameters
0.6    0.875                0.969                   0.994                 0.971               0.704                 0.881                   0.651
0.7    0.687                0.887                   0.999                 0.996               0.430                 0.687                   0.576
0.8    0.383                0.678                   1.000                 1.000               0.161                 0.383                   0.594
0.9    0.090                0.303                   1.000                 1.000               0.019                 0.090                   0.771

The last seven columns are different critical regions. The probability of a Type I
error is the same for critical regions 3 and 6 and the same for critical regions 4 and 5 due
to the symmetry of the example. Comparing critical regions 1 and 2 we note that critical
region 1 has a larger Type I error and for each value of the alternative binomial, a smaller
Type II error than critical region 2 (the graph makes the comparison easy).
There is a trade off between the probability of Type I and probability of Type II
for most common hypothesis tests. You can decrease the probability of a Type I error but
at the expense of increasing the probability of a Type II error.
A theoretical result is: for testing a simple hypothesis against a simple hypothesis,
there exists a critical region with no lower probability of a Type II error given a fixed
probability of Type I error. That is a beautiful result. It allows us to choose one test
given a simple versus simple situation. Unfortunately, in econometrics, both the
Page 6 of 20                                                                              Econ 413                                                                                                            Hypothesis Testing

maintained and the alternative are usually complex hypotheses and we have no such easy
result.
The size (or significance level) of a statistical test is the probability of a Type I
error. The power of a test is 1 minus the probability of a Type II error. For a given size,
we want a statistical test with highest power we can obtain. For most tests we encounter,
we specify a size, we obtain a critical region and theoretical results indicate what
alternatives have high power and what alternatives do not. For most tests we encounter,
we never know the probability of a Type II error. We rely on prior research to tell us
what tests are powerful against what alternatives.
Both the size (sometimes called the significance level) and the power are
probabilities of the critical region. The difference is the assumption made to compute the
probability. For the size, the probability is computed assuming the maintained
hypothesis. For the power, the probability is computed assuming some alternative
hypothesis.
The following table shows the powers for the critical regions we used above.

Critical Region
1                     2         3     4                                                 5                       6                         7
0,1,2,9,10,11

0,1,10,11

8,9,10,11

9,10,11
0,1,2,3

1,3,7,9
0,1,2

CR
P(Type I) 0.065 0.012 0.033 0.113 0.113 0.033 0.274
3,4,5,6,7,8,9,10,11

0,2,4,5,6,7,8,10,11
4,5,6,7,8,9,10,11

0,1,2,3,4,5,6,7,8
2,3,4,5,6,7,8,9

0,1,2,3,4,5,6,7
3,4,5,6,7,8

CR1                 CR2                     CR3                   CR4                 CR5                   CR6                     CR7
0.1   0.910               0.697                   0.910                 0.981               0.000                 0.000                   0.455
alternative hypotheses

0.2   0.617               0.322                   0.617                 0.839               0.000                 0.000                   0.459
Power for various

0.3   0.313               0.113                   0.313                 0.570               0.004                 0.001                   0.368
0.4   0.125               0.031                   0.119                 0.296               0.029                 0.006                   0.279
0.5   0.065               0.012                   0.033                 0.113               0.113                 0.033                   0.274
0.6   0.125               0.031                   0.006                 0.029               0.296                 0.119                   0.349
0.7   0.313               0.113                   0.001                 0.004               0.570                 0.313                   0.424
0.8   0.617               0.322                   0.000                 0.000               0.839                 0.617                   0.406
0.9   0.910               0.697                   0.000                 0.000               0.981                 0.910                   0.229
Page 7 of 20                           Econ 413                                          Hypothesis Testing

We can graph the power of the test used in Example 1 just as we graphed the
probability of Type I error.

The graph shows two power
curves with the same size –             1.000
namely CR3 ={0,1,2} and
0.900                      CR3       P(Type I)=.033        CR6
CR6={9.10,11}. Note that
CR3 is more powerful for                0.800
alternative probabilities of
0.700
heads less than .5 and CR6 is
more powerful for alternative           0.600                                                                                          CR2
CR3
than .5.                                                                                                                               CR6
0.400                                                                                          CR7

Test CR1={0,1,10,11} has a              0.300
smaller size (0.012) than                               CR2
0.200           P(T I)=.012                                                  CR7 P(T I)=.274
CR3 or CR6 (.033). CR2 is
less powerful for some                  0.100
alternatives and more
0.000
powerful for other                              0            0.2            0.4            0.6             0.8               1
alternatives than either CR3
or CR6. For alternatives .45
to .55, CR2 is less powerful
than either CR3 or Cr6.

To illustrate that power                   1.000

decreases as size decreases,               0.900
compare the graph of                       0.800
P(Type I)=.065

CR1={0,1,2,9,10,11} and                    0.700
CR2={0,1,10,11}. The size                                                                                P(Type I)=0.012
0.600
of CR1 is .065 and the size of                                                                                         CR1
0.500
CR2 is .012 . For every                                                                                                CR2
0.400
alternative probability of

power but its size is also                 0.200

greater. This shows the                    0.100

0        0.2      0.4         0.6            0.8          1
size (we want smaller size)
and power (we want greater
power). Smaller size comes
with smaller power for a
given test.
Page 8 of 20                             Econ 413                       Hypothesis Testing

The important concepts are:
1.         Size is the probability of Type I error (rejecting a true maintained).
Power is 1 minus probability of Type II error. A type II error is
rejecting a true alternative (accepting a false maintained).
2.         Every test is powerful for some alternative hypotheses and not powerful
for other alternative hypotheses.
3.         Reducing the size of the test reduces the power of the test against any
particular alternative (illustrated by CR1 and CR2, or CR3 and CR4, or
CR5 and CR6).
4.         For some statistical hypothesis tests, there are two tests one of which
has higher power for some alternatives and the other has higher power
for the remaining alternatives. CR3 and CR6 have the same size. CR3
is more powerful for alternative probability of heads less than .5 and
CR6 is more powerful for alternative probability of heads greater than
.5. CR4 and CR5 have smaller size than CR3 and CR6 but have a
similar comparison for alternatives less or greater than .5.
The critical regions CR3, CR4, CR5 and CR6 are often called one sided. The
critical regions contain only small or only large number of heads. Such one sided critical
regions are powerful for only large or only small alternative probability of heads. For
example, CR3={0,1,2} is more powerful for the alternative hypotheses of small
probabilities of heads while CR6={9,10,11} is more powerful for the alternative
hypotheses of large probabilities of heads.
Summary of POWER. Understanding POWER explains why we would use more
than one test. For example, the Ramsey test may have 1,2,3,4,… terms. Why use more
than just 2 terms? To increase the power of the specification test albeit at changing the
size (since doing 1,2,3,4… terms means that you are doing sequential statistical testS not
just one test). A Ramsey test with 2 terms will be more powerful for some alternative
hypotheses than a Ramsey test with 4 terms will be more powerful for some other
alternatives. Explaining which statistical test(s) to use is our only use of POWER.

REGRESSION TESTS – THE T TEST

For a regression, we might wish to test whether some independent variable has a
statistically significant effect on the dependent variable: Income on Consumption,
rebounds on percent win, number of competitors on sales, gender on wages, high school
rank on financial aid, etc. We usually test statistical significance by testing whether the
coefficient of the variable is equal to 0.
In OLS regression, with all 7 classical assumptions true, and the additional
assumption that the corresponding coefficient is ZERO, the reported T-statistic for a
coefficient is an observation of a random variable that has a T-distribution. The T-
distribution is due to W.S. Gosset, who worked for Guinness brewery and wrote under
the name Student – hence often the T is called Student's T distribution.
Page 9 of 20                                  Econ 413                              Hypothesis Testing

The maintained hypothesis does not specify the remaining coefficients nor the
variance of the equation σ2 – they can be any value. The maintained hypothesis is
complex and the alternative hypothesis is complex.
Recall that with the 7 classical assumptions in a simple, one variable regression,
the OLS estimator:
                                    2
1 is distributed as Normal(1 ,              )
x    2
i

We derive a random variable from this fact which has a T-distribution

( 1   ) / 
1
is distributed as T (n  K  1)

 2
(           ) / 2
x    2
i

Note that the σ on the top cancels with the square root of the σ2 on the bottom and
the only unknown in the formula is β1 .
The T-distribution has one parameter, unfortunately called Degrees of Freedom
(DOF). A rough guide for the value of the DOF parameter is to subtract the number of
coefficient estimates from n. In a simple regression there are two coefficient estimates –
the intercept and the coefficient of the single variable. The DOF is n-2. In a K variable
regression, there are K+1 coefficients to estimate: β0 β1 β2 β3 … βK and the DOF is n-
(K+1) = n –K - 1.
The display above has a T-distribution if the maintained hypothesis (all 7
assumptions) is true. It does not necessarily have a T-distribution if there are any
violations of any of the 7 assumptions.

( 1   )
1

can not be reported by a statistics program because β1 is
2
(             )
x    2
i

( 1 )

unknown. The reported T-statistic is
    2       . It will have a T-distribution if all
(                )
 xi2
7 assumptions are true and additionally β1=0.
Page 10 of 20                             Econ 413                       Hypothesis Testing

The T-test is to determine a critical region for the T-statistic – values of the T-
statistic for which you REJECT the maintained hypothesis that all 7 classical assumptions
are true and β1=0. T-tests can have one sided or two sided critical regions. To determine
the critical region, you must choose a size for the test – the probability of a Type I error –
the probability that you reject a true maintained hypothesis. What size you choose is
your own choice. It is common to have sizes of 0.01, 0.05 or 0.10. In fact in reporting
regression results, generally one reports whether the observed t-statistic is in a 10%, or
5% or 1% critical region. Note that if the observed t-statistic is in the 1% region, it is
certainly in the 5% and 10%.
In most cases, we report significance rather than stating ‘we reject the maintained
hypothesis at the 5% level’. We state ‘the coefficient is statistically significant at 5%’.
The meaning is the same – namely that the reported T-statistic is in the 5% critical
region. You would report significant at 1% and it is understood that it is also significant
at 5% and 10%.
Below is a plot of the density of a T-distribution for 10 degrees of freedom. For
10 degrees of freedom there is 2.5% of the distribution below – 2.228 and 2.5% of the
distribution above 2.228. I found 2.228 on page 585 (Critical values of the t-distribution)
in row 10 observations and column 2.5% one sided. The blue area shows a two sided
critical region 5% test. (see http://www.stat.tamu.edu/~west/applets/tdemo.html for one
sided areas).

If the T-distribution for 31 degrees of freedom was used then the probability that
you would observe a random variable (with a T-distribution) below -2.0395134464 is
0.025 and similarly above +2.0395134464 is 0.025. So there is a 5% chance that you
would observe a T-random variable below -2.0395134464 or above +2.0395134464. If
you sampled 1,000,000 T-random variables with 31 degrees of freedom, then
approximately 25,000 would be below -2.0395134464 and approximately 25,000 would
be above +2.0395134464.
http://calculators.stat.ucla.edu/cdf/student/studentcalc.php.
http://calculators.stat.ucla.edu/ says they are down 9-17-2012
Page 11 of 20                             Econ 413                              Hypothesis Testing

Also see
http://surfstat.anu.edu.au/surfstat-home/tables/t.php
http://www.tutor-pages.com/Statistics-Calculator/statistics_tables.html
http://bcs.whfreeman.com/ips4e/cat_010/applets/statsig_ips.html significance
http://bcs.whfreeman.com/ips4e/cat_010/applets/power_ips.html power
Example of T-test: Gender discrimination
To be more explicit, consider a gender discrimination case. The plaintiff contends
that males are discriminated against while the defense contends not. Below is a (partial)
estimation output in the case:
Variable         Coefficient   Std. Error t-Statistic   Prob.
GENDER           -3.848931     1.863662 -2.065251       0.0473

The variable GENDER is 1 for males, and 0 for females. The negative coefficient
means that if the individual is male (GENDER=1) then the dependent variable is
estimated to be -3.848931 less than if the individual is female.
The reported Prob. of 0.0473 is the size of a critical region [-∞,-2.065251]
[+2.065251,+∞] which uses the reported T-statistic to determine the critical region. The
reported Prob. value is often called a p-value. There is 0.02365 (=.0473/2) probability
that you would observe a T-random variable below the reported -2.065251 and 0.02365
probability that you would observe a T-random variable above the reported +2.065251
for the T-distribution with 31 degrees of freedom.
If we test the hypothesis that GENDER has no effect at 5%, we REJECT the
maintained hypothesis – the conjunction of all 7 assumptions plus β=0. Why? Because
the reported p-value is less then 0.05. The critical region for a 5% test is below -2.0395
and above 2.0395. The reported T-statistic is in the critical region. For a 1% test, we
accept the maintained hypothesis. For a 1% test, the critical region is [-∞,-2.744]
[+2.744,+∞] . Our observed T-statistic is not in the critical region.
One can always use the reported p-value to test. If your size selection is greater
than the p-value, REJECT, and if your size selection is less than the p-value, ACCEPT.
Commonly, the p-value is reported. If it is small, the reader will know that the
maintained is rejected and if the p-value is large, the reader will know the maintained is
accepted.
The t-test for a coefficient = 0 is claimed to be powerful against alternative
hypotheses in which the classical 7 assumptions are true but the particular coefficient is
not 0. But we do not know its power unless we specify a particular value for the
coefficient. If the alternative is a large absolute value of the coefficient (say 1,000) then
the power is greater than if the alternative is a smaller absolute value of a coefficient (say
10). We also know that increases in size increase power and smaller sizes have less
power. A 1% test has less power for any specific alternative than does a 5% test.
One sided tests:
Page 12 of 20                             Econ 413                             Hypothesis Testing

The critical region [-∞,-2.065251] [+2.065251,+∞] is two sided. We could have
specified a one sided critical region for the 5% test. There are two one sided critical
regions:
MINUS=[-∞,-1.696] and
PLUS=[1.696,∞] .
A one sided test must specify which side. For our example, the reported T-
statistic = 1.863662 is in the critical region PLUS and is not in the critical region
MINUS. The maintained hypothesis is identical for the two – all 7 classical assumptions
and β=0.
The critical region PLUS is more powerful for alternatives with β>0 and the
critical region MINUS is more powerful for alternatives with β<0.
Often, the one sided tests are phrased with the maintained hypothesis having one
side and the alternative the other side. E.g., Using HM and HA for the maintained and
alternative hypothesis:
Test 1. HM: β≥0 versus HA: β<0. Use the critical region MINUS.
Test 2. HM: β≤0 versus HA: β>0. Use the critical region PLUS
Note that using the critical region MINUS implies that any positive T-statistic will
accept the maintained hypothesis. Using the critical region PLUS implies that any
negative T-statistic will accept the maintained hypothesis.
One sided tests are specified where the a priori evidence or theory predicts a
positive or a negative coefficient. For example, testing the marginal propensity to
consume would be one sided HM: β≤0 versus HA: β>0 while testing the slope of the
demand curve would be one sided HM: β≥0 versus HA: β<0. Theory tells us that the
marginal propensity to consume is positive. Test 2 (critical region PLUS) has a
maintained of no effect or negative effect – 0 or negative would reject the Keynesian
theory. For a demand curve, theory and a huge amount of prior evidence indicates that
demand slopes downward. We use test 1 (critical region MINUS) so that if we estimate a
positive slope, we accept the maintained (there is no relationship between price and
quantity plus the classical 7 assumptions).
For a one sided test, you divide the reported p-value by 2 to report the
significance level for the one sided test. Recall the two sided p-value is the percent of the
distribution below the reported t-Statistic plus the percent of the distribution above the
reported t-Statistic. Two sides! Dividing by two yields one side.
Variable        Coefficient   Std. Error t-Statistic   Prob.
GENDER          -3.848931     1.863662 -2.065251       0.0473

The reported p-value is 0.0473. For the one sided test, the significance is
0.0473/2 = 0.02365 (the percent of the distribution in one tale, e.g., below the reported t-
Statistic). You reject the maintained hypothesis (β≥0) at 3% but not at 2%. For one
sided test 2, you accept the maintained hypothesis at any significance level.
Why use a one sided over a two sided test? The power will be greater for a one
sided test of the same size than for a two sided test. The left graph illustrates a one sided
Page 13 of 20                             Econ 413                        Hypothesis Testing

test size = 0.05 while the right illustrates a two sided size=0.05. The one sided power is
0.113 versus the two sided power 0.0078.

CR                                  CR                              CR

Notice that the power for a one sided alternative in this case is 0.113 while for the
two sided alternative it is 0.00708 – about 20 times smaller.
The further the alternative is from the maintained the smaller the difference in
power between one sided and two sided. Above the alternative was 0.1. Below the
alternative is 1.2 and the for both tests is 1.0 – at least to 4 significant digits.

CR                                CR                         CR

The main points are:
1.         If a priori theory or evidence indicates a sign of the coefficient, use a
one sided test because it is more powerful.
Page 14 of 20                               Econ 413                                 Hypothesis Testing

2.         The further away the alternative is from the maintained, the greater the
power. One side and two sided tests obtain identical power for
alternatives far from the maintained.

STATISTICAL SIGNIFICANCE
For most regression coefficients, economists state whether the coefficient is
statistically significant at some level (10%, 5%, 1%). SIGNIFICANT means
DIFFERENT FROM ZERO, i.e., the maintained hypothesis that β=0 is rejected. They
(economists) do not say 'REJECT the coefficient is 0'. They say the coefficient is
statistically significant! In many reports, the coefficients are labeled with a '*', '**', or
'***' to indicate significance at 10%, 5%, or 1%.
For some analyses, such as a gender discrimination case, rather than state
significance, economists may revert to the 'accept/reject' language. For example,
Variable          Coefficient   Std. Error t-Statistic   Prob.
GENDER            -3.848931     1.863662 -2.065251       0.0473
we would say that gender discrimination is accepted at the 5% level or we reject no
gender discrimination at the 5% level. There is no difference in the language. The
coefficient is statistically significant at 5% is identical to rejecting the coefficient is 0 at
5% or accepting the coefficient not 0 at 5%.
Note that when you REJECT the maintained hypothesis, you are REJECTing the
conjunction of the 7 assumptions PLUS the assumption that the particular coefficient is
ZERO. The alternative is a very large place – assumption 1, 2, 3, 4, 5, 6 or 7 could be
untrue while the particular coefficient is ZERO or 1,2,3,4,5,6, and 7 may be true while
the particular coefficient is not ZERO or every assumption may be untrue.

F-TEST
The coefficient F-test (in Eviews the Wald coefficient restriction test) tests
whether multiple coefficients are SIMULTANEOUSLY equal to 0 (or some other value).
The reported F-statistic in regression output and its p-value test whether all of the
coefficients in the regression, except the intercept, are equal to 0 simultaneously.
Sometimes, if we reject the maintained (all 7 classical and all coefficients are 0) we say
the regression is significant. The coefficient F-test is powerful for the alternative that all
7 classical assumptions are true and some or all of the coefficients are not 0 .
A useful formula for the F-test uses R^2 (see formula 5.14 on page 155):

ESS
2
^
ESS / K          TSS       n  ( K  1)               R              n  K 1
F                              *                                      *
RSS /(n  ( K  1)) TSS  ESS        K                    1 R
2
K
TSS
Page 15 of 20                              Econ 413                          Hypothesis Testing

For the financial aid example,
R-squared              0.764613           Mean dependent var            11676.26
Adjusted R-squared     0.749262           S.D. dependent var            5365.233
S.E. of regression     2686.575           Akaike info criterion         18.70654
Sum squared resid      3.32E+08           Schwarz criterion             18.85950
Log likelihood -       463.6635           F-statistic                   49.80764
Durbin-Watson stat     2.301406           Prob(F-statistic)             0.000000

0.764613 46
*     3.248323*15.33333 49.80762 which is not exactly the
0.235387 3
reported 49.80764 due to rounding errors. The R^2 at maximum precision is
0.764613059688
0.764613059688 46
*  3.248324052 *15.33333 49.80764 which is the reported
0.235386940312 3
value.
Testing a subset of the coefficients equal to 0 also uses an F-statistic. That a
subset of the coefficients are equal to 0 is a restriction (or set of restrictions) on the
equation. Consider a house price equation from chapter 11:

Dependent Variable: P
Method: Least Squares
Sample: 1 43
Included observations: 43
Variable Coefficient Std. Error t-Statistic Prob.
c(1)      C            153.4732 32.72537 4.689731               0
c(2)      AGE          -0.41988 0.267725 -1.56833          0.1261
c(3)      BATH         -10.7504 15.00847 -0.71629          0.4787
c(4)      BED          -1.37532 9.37739 -0.14666           0.8843
c(5)      CA           -2.58036 13.65486 -0.18897          0.8512
c(6)      N              -30.698 5.525107      -5.5561          0
c(7)      S             0.10713 0.020243 5.292093               0
c(8)      SP           -10.1114 11.82685 -0.85495          0.3986
c(9)      Y            0.004618 0.001569 2.944022          0.0058
R-squared 0.915694       Mean dependent var 242.3023
0.895858
Adjusted R-squared       S.D. dependent var 79.2415
25.5721
S.E. of regression       Akaike info criterion 9.504646
resid
Sum squared22233.7       Schwarz criterion 9.873269
Log likelihood-195.35    F-statistic           46.16177
1.502265
Durbin-Watson stat       Prob(F-statistic)            0

As AGE, BATH, BED, CA (central air) and SP are all statistically insignificant
at the 10% level individually, we test that they are all insignificant (equal to 0)
simultaneously or jointly. This is done in Eviews by clicking View in the regression
Page 16 of 20                              Econ 413                        Hypothesis Testing

output window, select COEFFICIENT TESTS and Wald Coefficient restrictions. In
Eviews, you have to enter the restrictions by forming an equation with C(i) where i is the
number of the coefficient. In this case, c(2)=0,c(3)=0,c(4)=0,c(5)=0,c(8)=0.

Dependent Variable: P
Method: Least Squares
Sample: 1 43
Included observations: 43
Variable Coefficient Std. Error t-Statistic Prob.
C            117.4655 19.98546 5.877548               0
N            -29.1998 5.139596 -5.68134               0
S            0.102644 0.009319 11.01483               0
Y            0.004117 0.00144 2.858676           0.0068
R-squared 0.904789       Mean dependent var 242.3023
0.897465
Adjusted R-squared       S.D. dependent var 79.2415
25.37404
S.E. of regression       Akaike info criterion 9.393739
25109.84
Sum squared resid        Schwarz criterion 9.557571
-
Log likelihood197.965    F-statistic           123.5382
1.67979
Durbin-Watson stat       Prob(F-statistic)            0

The maintained hypothesis is that all 7 classical assumptions are true, and the
coefficients of AGE, BATH, BED, CA and SP are 0 while the other coefficients are
allowed to be any value. The F-Statistic has a reported p-value of 0.505181 which is
larger than .10 (or .05 or .01) and we accept the maintained hypothesis.
In some cases it is easier to use the R^2 formula to calculate the F-Statistic – for
example when there are 49 coefficients to test equal to 0. The R^2 formula for testing a
subset of coefficients equal to 0 is

^ R2  R2 n  K 1
F  U 2R *
1 RU     r
where r is the number of coefficients being tested, RU2 is the R2 of the equation with all
the variables (unrestricted) and RR2 is the R2 of the equation excluding the variables
whose coefficients are to be tested (the equation is restricted to have some coefficients
equal to 0). In the house price equation, the RU2=.915694. The regression without the
variables (the Restricted equation) is
Page 17 of 20                            Econ 413                        Hypothesis Testing

The number of restrictions (variables to be tested) is 5, and n-K-1 is 43-8-1= 34.
The calculated F-statistic is 0.879581524446663. We can look up the p-value in a
statistical calculator http://rockem.stat.sc.edu/prototype/calculators/index.php3?dist=F

The blue area is the critical
region for a 50.52% test.

Why use the R2 formula? It may avoid much typing and possible typing mistakes.
Why use the Eviews WALD test? It is easy and avoids looking up the p-value for an F-
statistic.
We do not calculate the power of the test. We know that the power is greater for
alternative coefficient values which are farther from the maintained β=0 and the power is
greater for greater sizes. We do not know its numerical value. Most empirical work will
never report the power.
The Eviews reported Chi-square test is an asymptotic test but can be used for
estimations other than OLS, and in OLS when the errors are not assumed normal. Its
defects are that it is an asymptotic test (infinite data to be correct) and that it is less
powerful than the F-statistic when the 7 classical assumptions are true. Its advantage is
that it does not rely on normality.
RAMSEY RESET TEST
Our first econometric test is the RAMSEY RESET (page 198 in the text). This
test is not used very much in the econometric literature. Few packages calculate it
directly, and correcting the model when the Ramsey rejects is difficult. It is a
specification test and if your model fails the Ramsey test, you have a problem.
The maintained hypothesis (as in almost all tests that we consider) is the 7
classical assumptions – nothing more. RESET stands for REgression Specification
Error Test. The RAMSEY RESET is powerful against omitted variables (assumption 1),
incorrect functional form (assumption 1), and correlation between X and the error of the
equation (assumption 3). We use the Ramsey RESET test on all equations that we
estimate. If we reject the maintained, we are rejecting one or more of the classical 7
assumptions. As the Ramsey is powerful for failures of assumptions 1 and 3, our
suspicion is that assumption 1 is not true – functional form.
Unfortunately, as with many econometric tests, the Ramsey RESET is an
asymptotic test – unlike the T-test and F-test for coefficient restrictions which are exact
tests. The way that the RESET test is constructed, one does not know its exact finite
Page 18 of 20                                       Econ 413                                Hypothesis Testing

sample distribution. What is known is that IF the data were infinite, the test would have
the claimed distribution under the maintained hypothesis.
The RAMSEY test is calculated via a two stage regression. The first regression is
standard and the second is often called auxiliary. The first regression regresses the
dependent on the independents. The predicteds from that regression are calculated, and
then the auxiliary regression regresses the dependent on all the independents and the
square, cube, fourth power, etc. of the predicted dependent variable.
The first regression obtains
                                                
Y i   0   1* X 1i   2 * X 2i   3 * X 3i   4 * X 41i
The auxiliary regression is
           
Y i   0   1* X 1i   2 * X 2i   3 * X 3i   4 * X 41i   5 * Yi 2   6 * Yi 3  ...   i
If all 7 assumptions are true (in particular assumption 1), then β5 and β6 are
theoretically 0. If we have the correct specification, the predicteds squared, cubed,
etc. should have no explanatory value in the regression. Their coefficients are 0. The
Ramsey RESET test jointly tests whether the coefficients on the predicted squared,
cubed, etc. are equal to 0. However, the predicted squared, cubed, etc are not true
independent variables. They are observations of random variables. That means that the
standard F-test for coefficients jointly equal to 0 is not correct – the F-statistic, even if all
7 assumptions are true, is not an F-distribution except in the limit of infinite data. The
report of the Ramsey test has a test statistic called the Log Likelihood Ratio statistic. It is
an asymptotic test and the reported statistic is a Chi-square statistic. No matter for
Eviews – we can simply view the p-values reported for the test.

The test requires YOU to indicate the number of fitted items. One fitted item
includes the predicted squared, two fitted items includes the predicted squared and cubed,
three fitted items includes the predicted squared, cubed, and to the fourth power, and so
on. The suggestion in the literature is to use 2, 3, and/or 4 fitted items. If Ramsey rejects
at 2 items, stop. If Ramsey accepts at 2 fitted items, try 3 and then 4. Accept the
maintained only if Ramsey does not reject at 2, 3, or 4 fitted items.
Page 19 of 20                             Econ 413                       Hypothesis Testing

With more fitted items (a higher degree polynomial in the fitted), there is a greater
'chance' of detecting some misspecification. That is, more fitted items means greater
power of the test. However, more fitted items may reduce the power for a given size of
the test because in order to achieve that same size, the critical region is smaller. This is a
technical issue which has little intuitive explanation. What is done (when it is done) is to
reject the maintained if any Ramsey with 2, 3, or 4 fitted items rejects.
In the house price example,

Ramsey RESET Test:
2 fitted terms
F-statistic            5.892869 Probability 0.006623
Log likelihood ratio   13.48361 Probability0.001181
3 fitted terms
F-statistic             4.00073 Probability 0.016165
Log likelihood ratio   14.07235 Probability0.002808
4 fitted terms
F-statistic            3.093637 Probability 0.030273
Log likelihood ratio   14.85007 Probability0.005022

Note that we do NOT look at the F-statistic value or its probability. Eviews
reports that F-statistic but it is NOT distributed as an F-distribution, even asymptotically.
The Log likelihood ratio statistic is distributed as a Chi-square statistic asymptotically.
The maintained hypothesis is that all 7 classical assumptions are true. The Ramsey
rejects the maintained (powerfully) with violations of functional form, omitted variables,
and right hand side variables correlated with the error of the equation. It can also reject
because the data set is not infinite.
The literature indicates that a rather large size should be used for Ramsey test. It
is better to reject a true maintained than to accept a false maintained. Thus the
econometrician is willing to have higher type I error to gain more power (hence lower
type II error). If the Ramsey rejects, the regression coefficients (and all the associated
statistics) are suspect. If the Ramsey errantly accepts the maintained, you would be led to
incorrectly use the coefficients of the regression.
If the Ramsey rejects, the first suspicion is incorrect functional form. The
econometrician then tries other functional forms. There is a caution: too much 'fitting'
can make a bad regression. The chance of getting an acceptance rises with the number of
regressions you run. Run enough regressions and you might errantly accept the
maintained. Econometrics is an art in that the individual researcher has to use her own
judgment.
Finally, few empirical papers use the Ramsey test. My suspicion is that most (at
least many) regressions published will obtain a Ramsey rejection. Researchers do not do
the Ramsey because when it fails, it is hard to find a cure. With other econometric tests,
there are solutions to the misspecification. For this class, you WILL do Ramsey tests.
To continue with the house price example (house11),
Page 20 of 20                             Econ 413                      Hypothesis Testing

Dependent Variable: P
Method: Least Squares
Sample: 1 43
Included observations: 43
Variable Coefficient Std. Error t-Statistic Prob.
C            51.54109 25.08319 2.054806          0.0468
1/N          99.17609 15.71945 6.309134               0
S            0.031709 0.032921 0.96317           0.3416
Y            0.004617 0.001375 3.357931          0.0018
S^2          1.94E-05 8.79E-06 2.202611          0.0338
R-squared 0.915341       Mean dependent var 242.3023
0.90643
Adjusted R-squared       S.D. dependent var 79.2415
24.23937
S.E. of regression       Akaike info criterion 9.322778
resid
Sum squared22326.8       Schwarz criterion 9.527569
Log likelihood-195.44    F-statistic           102.7153
2.033318
Durbin-Watson stat       Prob(F-statistic)            0
Ramsey test for 2, 3 and 4 fitted items
1.537558
Log likelihood ratio     Probability           0.463579
3.121154
Log likelihood ratio     Probability            0.37332
3.422794
Log likelihood ratio     Probability           0.489714
I changed the specification so that N, the quality of the neighborhood, entered as a
reciprocal. That alone still resulted in a Ramsey rejection. Adding S^2 (square feet of
the house squared) obtained a Ramsey acceptance. Whether this is THE correct
specification I do not know, but I can argue that the Ramsey accepted.
See Ramsey, J. B. and A. Alexander (1984) “The Econometric Approach to
Business-Cycle Analysis Reconsidered,” Journal of Macroeconomics, 6, 347–356., for a
full account of the power of the test.

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 13 posted: 11/3/2012 language: Unknown pages: 20