Module 08 - STAT 101 by nuhman10


									                     STAT 101, Module 8:
  Statistical Testing, null hypotheses, test statistics, p-values
                       (Book: chapter 10)

   At the end of the last module we talked about values of μ that are
    compatible with the data in terms of their X and s. In this module we
    expand on these thoughts and develop the vocabulary and
    argumentation methods of statistical testing.
   Example 1: A manufacturer of consumer electronics would like to
    know how many households intend to purchase a computer next year.
    Management hopes that the proportion is greater than 10% in order to
    justify sales projections. Does the survey that shows 14% willingness
    to purchase support their claim? Could the good news be the result of
     Since 10% is a crucial threshold it makes sense to use p = 0.10 as
     what we call a null hypothesis and check whether the data from their
     survey is compatible with this assumption.
   Example 2: Suppose you sample 25 students from the Penn class of
     2006 and observe that their average SAT is 1380 with an SD of 125.
     An admissions officer claims the average SAT is at least 1420. You
     are surprised at the inconsistency, but then again, this assertion might
     be compatible with the data. One could use μ = 1420 as a null
     hypothesis and check how compatible it is with the numbers from the
   Example 3: It is claimed that a coin is fair. A sample of 144 tosses
    results in 64 heads. Is this compatible with the assumption of a fair
    coin? The natural null hypothesis is p=0.5 which defines fairness.
   Example 4: At the end of Module 7 we mentioned elections. Again, a
    rate of 0.50 of favorable likely voters is critical to the claim of being
    ahead in the polls. Therefore p = 0.50 is a natural null hypothesis.
   Example 5: It is known that a standard surgery requires a mean
    hospital stay of 5.4 days. A new and less invasive type of surgery is
     said to require only 3.3 days in the hospital on average. How sure are
     we that the new method actually does require fewer hospital days?
     It would appear natural in this case to play “devil’s advocate” and
     check whether the null hypothesis of a population mean of 5.4
     hospital days is actually compatible with the data for the new method
     that seem to have a sample average of 3.3. If it is compatible with the
     data, maybe the case for the new method is not strong enough, or one
     needs more data.
     Note that because of the long experience with the old type of surgery
     one can assume N ≈ ∞, hence 5.4 can be seen as a population mean.
     For the new type of surgery there will be much less experience and N
     will be small at this exploratory stage, hence 3.3 should be interpreted
     as X .

The Components of Statistical Testing
   Statistical testing can be used in several ways:
        o Statistical testing can be a Socratic game: Allow someone to
          make an assertion, and play along till it leads to an apparent
          absurdity… or not. Example: the admission officer’s assertion
          about SAT averages of students admitted by Penn.
        o Statistical testing can be a devil’s advocate game: Assume an
          undesirable scenario, and try to show it probably isn’t so.
          Example: the assumption that the new type of surgery is no
          better than the old type.
        o Statistical testing can be used to check whether a norm is
          likely to be satisfied. Example: examining fairness of a coin.
           Note that we tend to cast statements in vague terms: “probably”, “likely”.
           The reason is that statistical testing quantifies uncertainty about
           conclusions. Science never deals in absolute certainties, although some
           conclusions can reach certainty beyond reasonable doubt.

   Null hypotheses: A null hypothesis is an assumption about a
    population quantity. Note the “population” part. Null hypotheses
    are never about actually observed statistics computed from data.
Population values are the targets estimated by sample values, and the
sample values are used for inference about population values.
The two fundamental methods of statistical inference methods are:
1) confidence intervals and
2) statistical tests.
We consider only population means μ and population proportions p
(=probabilities), and the only type of assumption we will consider is
that μ or p take on a specific value of interest.
What these hypothesized values are depends on the context: If it is
about testing fairness of a coin or commanding a majority in the polls,
the natural null hypothesis is p=0.5. If the business plan asks for a
minimum demand of 10% of households, then p=0.1 is the natural
null hypothesis. If a new type of surgery is asserted to shorten
hospital stays, then the devil’s advocate says the mean reduction is
       One could also consider null hypotheses about population standard
       deviations σ, and this is done, but it is much less important. Below we
       will consider differences in population means and population proportions
       between groups, and finally null hypotheses about population slopes in
Notation for null hypotheses:
       H0: μ = μ0          and          H0: p = p0
where μ0 and p0 are the assumed population values.
In the case of the new type of surgery, we could let μ be the
population mean of hospital days with the new procedure, so the null
hypothesis of no improvement over the old type of surgery is that both
types have the same population average:
       H0: μ = 5.4
When it comes to testing fairness of coins or claims to majorities in
polls, the null hypothesis is:
       H0: p = 0.5
   o Reminder: H0: X = 5.4 is completely mistaken. The quantity X will lend
     evidence about μ, but it cannot be the subject of a null hypothesis.
   o Why “null” hypothesis? There is another type called “alternative
     hypothesis”, hence “null” is opposed to “alternative”. The alternative
     hypothesis is essentially “not the null hypothesis”. There are subtleties
         about alternative hypotheses that we will not discuss here (two-sided
         versus one-sided alternatives: Ha: μ ≠ μ0 and Ha: μ > μ0).

 Test Statistics: A test statistic computed from data provides evidence
  for or against the null hypothesis. It is not too difficult for us to
  devise a test statistics for the above null hypotheses. Similar to
  confidence intervals, the ideas center on the deep fact that means vary
  across datasets, that their variation can be quantified by the standard
  error σ( X ), and that σ( X ) can be estimated from a single dataset by
  the standard error estimate stderr = s(X)/N1/2 .
   Let us have another look at the graphs at the end of Module 7:
To play the game of testing a null hypothesis, we assume that it is true
and that the data have the hypothesized population mean μ0. We then
check how extreme the estimate X of μ0 is in light of the distribution
of X :
   o In the first graph above, X is less than two standard errors away
     from μ0 . This is counted as compatible with the null hypothesis
     that μ has this particular value.
   o In the second figure, X is more than two standard errors away.
     One judges this X to be too unlikely under the null hypothesis
     and hence incompatible with it.
In light of the CLT, a good test statistic would be a Z-score formed
under the null hypothesis:
            X  0
      z =
            (X )
If z is more extreme than ±2 (that is, > +2 or < –2), we will say:
we reject H0. What we really mean is: H0 (the assumption that μ =
μ0) is not very compatible with the data.
An obvious problem is that while μ0 is specified by the null
hypothesis, the standard deviation σ(X) of the data and hence the
standard error σ( X ) = σ(X)/N ½ are not specified and hence need to
be estimated. The result is what is called the t-statistic:
              X  0
      t =
            stderr( X )

where stderr( X ) = s(X)/N ½ is the standard error estimate as usual.

   o We can think of the t-statistic as a change of units in X : make
     μ0 the new origin of the scale and make stderr the new unit.
     If t =1.5, then X is 1.5 stderr to the right of μ0. Therefore, |t|
     measures the distance of X from μ0 in multiples of stderr.
   o |t| is a measure of evidence against the null hypothesis: if |t|
     > 2, we “reject the null hypothesis” (although see what
 Null Distribution: The probability distribution of the test statistic t
  assuming H0: μ = μ0 is called the null distribution. Note it is a
  hypothetical distribution, literally. It is used to judge what values of
  t and hence of X should be considered as giving evidence for or
  against μ0. Large values |t| will count as evidence against μ0.
   Now that we have replaced the denominator σ( X ) of z with the
   quantity stderr which is no longer a constant but a random variable,
   the probability distribution of the resulting t has changed: If the
   observations themselves are normal, the random variable z is normal,
   but the random variable t is no longer exactly normal. It has what is
   called “Student’s t-distribution” (recall the story of “Student” aka Gosset at
   the Guinness Brewery in 1908). The t-distribution becomes very nearly
   normal for large N, but for N <60, the cut-off value, which should be
   the 97.5% quantile, is greater than 2 and grows as N gets smaller.
   Here is one more time the table from Module 7, where we included
   N=∞, which is the normal distribution:
          N:          10       15         20          30         40
          t0.975 :   2.23     2.13       2.09        2.04       2.02

          N:          50       60         75         100         ∞
          t0.975 :   2.01     2.00       1.99        1.98       1.96

   Using these “exact” cut-offs, we say we reject H0 when |t| > t0.975. The
   union of the two intervals (–∞, –t0.975) and (t0.975, +∞) is called the
   rejection region. The interval (–t0.975, t0.975) is called the “non-
   rejection region”.
          Purists are against using the term “acceptance region”, hence it’s “non-
          rejection region”. Nicer terminology would use the words “incompatible”
          and “compatible”, which is what μ0 and X are depending on where t falls.
   In the next graph below, the part of the axis with the gray area is the
   rejection region, the part in between is the non-rejection region.

   The t-statistic is always reported in null hypothesis testing. When you
   see it, check it against the rough cut-offs ±2, but be aware that JMP
   and all other software use the t-quantiles as in the above table; they
   are exact if the observations are normally distributed. If the data are
   not normally distributed (as for discrete and skewed distributions),
   even the t-distribution is only an approximation. Visually, the t-
   distribution is indistinguishable from the normal distribution, except
   when N is extremely small. The following figure shows the t-density
   function for N=20.

 Significance Levels: The choice of boundaries at the 2.5% and 97.5%
  quantiles of the null distribution amounts to a test at the significance
  level α =5%, or simply at the 5% level. The significance level α is
  the tail probability that defines the cut-off values, approximately ±2.
  In the figures above, the gray areas denote the 5% tail probability α,
  divided into two areas of α/2 = 2.5% each.
The choice of 5% is a convention that can be changed. The
significance level of 5% is the most frequent choice, but when the
evidence against the null hypothesis is required to be more stringent
in order to reject it, one chooses a significance level of 1% or even
lower. In this case, the quantiles for the t-distribution are as follows:

      N:             10        15        20        30       40
      t0.995:      3.25      2.98      2.86      2.76     2.71

      N:             50        60        75       100       ∞
      t0.995:      2.68      2.66      2.64      2.63     2.58

It appears that cut-offs ±2⅔ are a good and conservative choice for
testing at the 1% significance level. Again, all software, including
JMP, uses the “exact” quantiles of the t-distribution.
In general, for a given significance level α, one uses the (1–α/2)-
quantile of the t-distribution as a cut-off. That is:
        Reject H0 at the significance level α              |t| > t1–α/2

   o The lower the significance level α, the larger is the non-
     rejection region, and the less likely is rejection of the null
   o It is possible that we can reject at the 5% level, but not at the
     1% significance level. This is the case if t is between 2 and 2.6:
     2 < t < 2.6 means rejection at the 5% level,
   o The significance level α is also called the “Probability of a
     Type 1 Error”. A Type 1 Error is the rejection of the null
     hypothesis when it is in fact true. But the probability that this
     happens is exactly α, by construction:
                P( rejection of H0 at the level α | H0 is true )
                = P( |t| > t1–α/2 | H0 is true ) = α
         See the red box above. So far we have discussed α as a tail
         probability, which it is: it is the probability of seeing a value of
         the t-statistic more extreme than the cut-off t1–α/2, but this
         probability under the null hypothesis is α.
         (There is a notion of Type 2 Error, which is not-rejecting H0 when H0 is in
         fact false. This is a more difficult concept and we will not explain it.)

 P-Values: The p-value is the achieved significance level. The idea
  behind the p-value starts with the following question:
         What would be the significance level for which the observed X
         and t would be exactly on the cut-off?
   That significance level is the p-value. The situation is depicted here:
  o The p-value is a measure of evidence in favor of H0: μ = μ0.
     If the p-value falls below α, we say there is insufficient
     evidence in favor of H0: μ = μ0, hence:
     Reject H0 at the significance level α          p-value < α
  o The p-value is a random variable, even though it is calculated
    as a hypothetical probability assuming H0: μ = μ0 is true.
  o The p-value is a transformation of |t| to the 0-1 range:
            p-value = 1  μ0 = X
                p-value = 0  | μ0 – X | = ∞
      o The p-value is the hypothetical probability of observing a value
        of t more extreme than the one in hand. If this hypothetical
        probability is small, it means the value of t in hand is extreme
        under H0. Hence we reject H0.
      o Why p-values are so popular: They allow testing a null
        hypothesis at all conceivable significance levels. Once we
        know the p-value, we know how to answer if someone asks for
        a test at the 5% level, at the 1% level, at the 0.5% level… The
        answer is always: if the p-value is below the significance level
        α, we reject H0 at the significance level α.
      o We see, therefore, that a p-value 0.02 allows us to reject at the
        5% levels, but not at the 1% level.

 Confused? That’s ok. Here are handy rules for real life:
      o Reject H0 at the 5% significance level if the p-value is below
        0.05. This never fails.
      o If the t-statistic is < –2 or > +2, expect rejection, but in border-
        line cases where the t-statistic is very near +2 or –2, recall that
        the cut-offs ±2 are not exact, hence trust the p-value.
      o Keep in mind that statistical testing is a “what if” game. It
        starts with “what if μ = μ0?” and checks what the consequences
        are in light of the data. Rejection of μ = μ0 means that this
        assumption is not compatible with the data.

 Confidence Intervals with Coverage Probability 1-α:
   Logically equivalent to rejection at the 5% significance level is μ0
   falling outside the “exact” CI (provided by the software). The rough
   CI = ( X ± 2 stderr) is usually correct but may fail in borderline cases
   when | X – μ0| ≈ 2 stderr. The “exact” CI with coverage probability
   1– α/2 is:
         CI1–α = ( X – t1–α /2 · stderr, X + t1–α /2 · stderr )
   Therefore, a rough 99% confidence interval is X ±2⅔ · stderr.
  The general connection between α-level testing and (1–α)-CIs is:
          Reject H0 at the significance level α  μ0  CI1–α

 Testing Means in JMP:
         Analyze > Distribution > (select Y,Columns) > OK
         (click tiny red triangle icon, next to variable name) Test Mean
         > (enter the values μ0 or p0 to be tested in the upper field) > OK

  Here is Example 3, the problem of testing fairness of a coin (H0:p=.5)
  where 64 heads in 144 flips were observed (Sim Dice and Coin
         Moments                                                                  t Test
                                                       Test Statistic           -1.3370
         Mean                 0.4444444                Prob > |t|                0.1834
         Std Dev              0.4986384                Prob > t                  0.9083
         Std Err Mean         0.0415532                Prob < t                  0.0917
         upper 95% Mean       0.5265823
         lower 95% Mean       0.3623066
         N                          144

         Test Mean=value
         Hypothesized Value         0.5
         Actual Estimate        0.44444
         df                         143                   .40 .45 .50 .55 .60
         Std Dev                0.49864

  JMP gives you a picture of the null distribution with the area of the p-
  value colored in blue. Note that it is centered at the hypothesized
  population mean 0.5, shown also in the numeric output. We see the
  mean or proportion twice: among moments and below the
  hypothesized value.
  o Our two-sided p-value is written as “Prob > |t|”. Its value is
    0.1834. Since it is not below 0.05, we do not reject the null
         Our p-value is followed by two one-sided p-values for which we have no
         use; they are associated with one-sided alternative hypotheses.
  o The “Test Statistic” is the t-statistic (it can be the z-statistic if the
    standard deviation is known). Its value –1.337 is between ±2,
    hence again no rejection.
   o The CI (0.362, 0.527) contains the hypothesized value 0.5, hence
     yet again no rejection.

 Example 1: Recall the manufacturer’s target is an excess of 10% take
  rate, and the survey says the rate of self-declared intent of purchase is
  14% of the households. Since 10% is the critical border line, we take
  H0: p=0.10 as the null hypothesis, and the question to be answered is
  whether the observed proportion p =0.14 lends evidence against H0.
  To proceed, we need one more piece of information: the sample size,
  which happens to be N= 500. At the end of Module 7 we saw that the
  standard error estimate for the proportion is
       stderr( p ) = ( p (1– p ) / N )1/2 = (0.14·0.86/500)1/2 = 0.0155
               ˆ       ˆ     ˆ
   hence the test statistic is
                             p p
                              ˆ           0.14  0.10
                                                      2.58 .
                           stderr ( p )     0.0155
   Now this is fortunate: 2.58 is greater than 2. Hence we can reject the
   assumption that the true population proportion is 10%.

 Example 2: The null assumption in the Penn student SAT problem is
  H0: μ = 1420, the assertion made by the admission official. He/she
  may have made the assertion based on the complete census of Penn
  students; we wouldn’t know, it’s just a very specific assertion. Our
  evidence is rather scant: a random sample of N=25 students with a
  sample mean X =1380 and a standard deviation s=125. Hence the
  standard error estimate is s/N1/2=125/5=25. The test statistic is
                             X  0       1380  1420
                                                      1.6
                           stderr ( X )       25
   The value 1.6 is clearly below 2, hence the assertion that the
   population mean of SAT scores is 1420 is compatible with the data.
   A problem is of course that the data is so small. With a larger sample
   we’d have a better chance to refute the admission official.
 Example 4: Can a candidate with 56% of likely voters in his/her favor
  brag that he/she has a majority? We need to know the sample size of
  the survey. If it is N=961, then stderr = (.56 · .44 /961)½ = 0.016, and
  the test statistic is (.56 – .50)/.016 = 3.75. The value p-value would be
  0.0001874, which is smaller than all conventional significance levels,
  and hence a pretty sure thing. It means that if the truth is still p=0.50,
  then one would find a value as extreme as 56% in fewer than 2 out of
  10,000 surveys of size N=961.

To top