Docstoc

Hypothesis Tests with Means

Document Sample
Hypothesis Tests with Means Powered By Docstoc
					Hypothesis Tests with
               Means
Sampling Distribution of the Mean

 The distribution of the means of an infinite
number of samples of size n
 This could be with respect to a single mean, a
difference between independent means, etc.
 Note: This forms the basis for generating critical
values, p-values, etc.
Central Limit Theorem

 The sampling distribution of the mean
approaches normal, and this tendency increases
with N
 The mean of the sampling distribution (μĀ)
equals μ and the variance of the sampling
distribution (σ2Ā) equals σ2/N
Extensions of the C.L.T.

 Regardless of the shape of the original raw score
population, as N increases the distribution of
sample means will approach normal
 For a sample size of 30 or greater even an
extremely skewed raw score population will
generate a normally distributed sampling
distribution of the mean
Z-test for a Single Mean
                     σ Known


 Z = (A - μ) / σ was used to test the relative
position of a score (A) within a distribution of
scores with mean μ and standard deviation σ
 Z = (Ā - μ) / σĀ is used to test the relative
position of a mean within a sampling distribution
of means with mean μ and standard deviation
σĀ(σ/ √n)
Z-test for a Single Mean
Example
 Dr. Brown is interested in whether a group of
students who have followed strict vegetarian diets
(n = 25) since early childhood will have
significantly different IQs than the general
population (α=.01)
 Ho: μv = 100
 Ha: μv ≠ 100
Z-test for a Single Mean
Example
 Ā = 91
 Z = (Ā - μ) / σĀ = (91 - 100) / (15 / √25) = -3
 The probability of obtaining a test statistic more
extreme than Z=-3 is .0013
 With α=.01 we can conclude that the vegetarian
diet group scored significantly less than the
general population in IQ (i.e., .0013 < .005)
 ◦ Thought: What would a good effect size be?
How can I do a One-
Sample z test in R?
  iq<-c(95,105,105,108,85,87,83,80,120,72,
 107,94,90,67,71,63,84,100,83,87,102,103,
 90,101,93)
  install.packages("TeachingDemos")
  library(TeachingDemos)
  z.test (iq, mu=100, sd=15)
  ◦ One Sample z-test data:
  ◦ iq z = -3, n = 25, Std. Dev. = 15,
  ◦ Std. Dev. of the sample mean = 3, p-value =
    0.0027
t-test for a Single Mean

 The logic of the z-test for a single mean and the
t-test for a single mean is the same, although the
t-test is adopted when we do not have information
about the population standard deviation
 One of the consequences of using s2 as an
estimate of σ2 is that the sampling distribution is
no longer standard normal (unless n is large)
Standard Normal and t
distributions

 More specifically, for small sample sizes s2 will
often underestimate σ2 and result in a test statistic
(t) that is larger than what would have been found
if we (could have) used the true value of σ2
 Therefore, the t-distribution is flatter than the
standard normal (especially for small n) and
requires larger test statistic values for significance
Student’s t-distribution
t-test for a Single Mean
                     σ Unknown


 t = (Ā - μ) / sĀ is used to test the relative
position of a mean within a sampling distribution
of means with mean μ, standard deviation sĀ (s/
√n), and df = n–1
 The statistical decision rule we use is to reject
Ho: μ = ?, if |t|≥tα, df (for a two-tailed test)
t-test for a Single Mean
Example
 Dr. Brown is interested in whether a group of
students who have followed strict vegetarian diets
(n = 25) since early childhood will have
significantly different IQs than the general
population (α=.01)
 Ho: μv = 100
 Ha: μv ≠ 100
t-test for a Single Mean
Example
 Ā = 94.36, s=9.76
 t=(Ā - μ)/sĀ = (94.36-100) / (9.76 /√25)=-2.89
 The one-tailed probability of obtaining a test
statistic more extreme than t=-2.89 with n=25 is
.004 (or one-tailed t.01, 24 = - 2.492)
 With α=.01 we can conclude that the vegetarian
diet group scored significantly less than the
general population in IQ
How can I do a One-
Sample t test in R?
  iq<-c(85,105,105,95,85,107,83,80,110,92,
 107,94,90,94,94,80,84,100,83,87,102,103,90,
 111,93)
  t.test (iq, mu=100)
  ◦ One Sample t-test data:
    iq t = -2.8896, df = 24, p-value = 0.004027
    alternative hypothesis: true mean is less than 100
Effect Size

 The fact that an effect is statistically significant
does not necessarily mean that the effect has any
“practical significance”
 As N increases the probability of finding even
minute differences between means statistically
significant approaches 1
 Effect size measures help to clarify the meaning
of significant or nonsignificant effects
t-test for a Single Mean:
Effect Size
 In order to be able to quantify how large the
  difference is between the sample mean and the
  hypothesized value, Cohen’s d can be used
   
Guidelines for Interpreting
Cohen’s d
 Cohen provided the following guidelines for
interpreting d:
  ◦ .20-.50 is a small effect
  ◦ .50-.80 is a medium effect
  ◦ .80 + is a large effect
 Therefore, for our experiment we can conclude
that the effect is statistically significant and has a
moderate effect size
 ◦ Note that even the difference between the sample mean
 and hypothesized value could be used as an appropriate
 measure of effect sizes
Confidence Intervals for the One-
Sample t-test
 Confidence interval: If samples of size n are
drawn repeatedly from a population, and a CI is
calculated from each sample, then 95% of these
intervals should contain the population mean
 1-α% CI = Ā +/- tα,df sĀ
 For our previous example:
  ◦ 99% CI = 94.36 +/- 2.797 (1.95) = {88.91, 99.81}
 The fact that the CI does not include 100 verifies
our previous hypothesis testing conclusion
Single Sample Inference with
Nonnormal Distributions
 Valid use of the one-sample t-test requires that
  ◦ A) The observations are independent of one another
    (random sampling from the population)
  ◦ B) The population distribution is normal in form
 When the population distribution is not normal
the probability of spurious results (Type I/Type II
errors) increases (and this tendency is greatest
with small n)
 ◦ Thought: Why does the CLT not completely fix this
 problem?
 In this case we can use a nonparametric test
Wilcoxon’s Signed Rank Test
 When the distribution shape is nonnormal we can
use the signed rank test to make more valid
inferences
 Let’s say we want to determine if the average IQ of
Psych grad students is greater than 110 (Ho: μ=110;
H1: μ>110)
  ◦   Data:    115, 123, 128, 116, 106, 91, 113
  ◦    (-110):   5, 13, 18, 6, -4, -19, 3
  ◦   Ordered (regardless of sign): 3,-4,5,6,13,18,-19
  ◦   Signed Ranks: 1,-2,3,4,5,6,-7
Wilcoxon Signed Ranks Example
 We take the smaller of the absolute value of the
sum of the positive ranks (19) and negative ranks
(9) as our test statistic (T)
  ◦ Thus, we compare T=9 to our critical value in
    Appendix T (which is based on n, α and one-
    tailed/two-tailed test)
  ◦ At α=.05 (one-tailed) we would need a T of 3 or less
    to reject Ho (hence we do not reject the null hypothesis
    and conclude that grad student IQ’s do not differ from
    110)
How can I do a Wilcoxon Signed
Ranks test in R?

 ◦ > wilcox.test(iq, mu=110, alternative="greater")

   Wilcoxon signed rank test data: iq
   V = 19, p-value = 0.2344
   alternative hypothesis: true location is greater than
    110
Two Independent-Samples t-test

 The two independent-samples t-test is much
more common in empirical studies than is the
one-sample t-test
 The primary reason is that we rarely have
information about population means (or even
legitimate comparison values) so we compare
two sample means (where often one group is a
control group)
Sampling Distribution of the
Difference between Means

 The mean of the sampling distribution of the
difference between means is μ1 -μ2 and the
variance is σ21/n1 + σ22/n2
 From this we could deduce that the formula for
the two independent-samples t is:


               ( X 1 - X 2 ) - (m1 - m2 )
          t=
                       s   2
                           1       s   2
                                       2
                               +
                        n1         n2
Two Independent-Samples t
 However, if we assume that the variances are
equal (more on that to come) we can take a
weighted average of the variances (i.e., compute a
pooled estimate of the variance)
 This statistic will be:
Two Independent Samples t

 This statistic is distributed as t with n1+n2-2
degrees of freedom
 The assumptions of this statistic are:
  ◦ 1. Subjects are randomly and independently selected
    from their respective populations
  ◦ 2. Population variances are equal
  ◦ 3. Population distributions are normal in form
Two Independent-Samples t
Example
 Dr. Stein would like to know if there are
motivational differences between students in an
8:30 a.m. class and students in an 11:30 a.m.
class (α=.10)
 She posts a notice for students to sign up to fill
out a questionnaire on “achievement motivation”
 5 and 15 students from the 8:30 and 11:30
classes, respectively, show up to fill out the
questionnaire
Two Independent-Samples t
Example

 Results (achievement motivation scores):
  ◦ 8:30: 12, 1, 14, 15, 2
  ◦ 11:30: 6, 4, 4, 6, 3, 6, 3, 7, 5, 6, 5, 4, 5, 4, 6
 Ā1 = 8.8, s1 = 6.76
 Ā2 = 4.93, s2 = 1.22
 Ho: μ1 = μ2, H1: μ1 ≠ μ2
 df = n1 + n2 - 2 = 5 + 15 - 2 = 18
Two Independent-Samples t
Example
 Decision Rules:
  ◦ If |t| ≥ tα,df then reject Ho
  ◦ If |t| < tα,df then do not reject Ho




  ▪ Note that if we were using R the two tailed  
        p-value would have been .039
Two Independent-Samples t
Example

 t.10,18 = 1.734
 Therefore, since our obtained t (2.23) is greater
than our critical t (1.734) we reject the null
hypothesis and conclude that motivation scores
are significantly higher in the 8:30 a.m. class than
in the 11:30 a.m. class
 Using our two-tailed p-value (.039), we would
make the same conclusion because .039 < .10
How can I do a two independent-
samples t test in R?
 motiv<-c(12, 1, 14, 15, 2, 6, 4, 4, 6, 3, 6, 3, 7,
  5, 6, 5, 4, 5, 4, 6)
 class<-rep(c("8:30","11:30"),c(5,15))
 t.test(motiv~class,var.equal=T,conf.level=.9)
 ◦ Two Sample t-test data: motiv by class
 ◦ t = -2.2257, df = 18, p-value = 0.03906
   alternative hypothesis: true difference in means is
   not equal to 0
 ◦ 90 % CI: -6.8792855 -0.8540479
Confidence Interval for the
Difference between Means


 90% CI = (Ā1 - Ā2) +/- t α,df sĀ1-Ā2
 For our previous example:
  ◦ 90% CI = (8.80 - 4.93) +/- 1.734 (1.74) = {0.85, 6.89}
 The fact that the CI does not include 0 verifies
our previous conclusion that the difference
between the mean differs from 0
Cohen’s d

 For population parameters, d = (μ1 - μ2) / σ
 For sample statistics, d = (Ā1 - Ā2) / sp,
  ◦ Where sp represents the square root of the pooled
    variance
Cohen’s d Example
 From our previous example we can calculate d to
be 1.15, which would be considered a large effect
Measures of Association
Strength
 Eta-squared (η2), or the squared point- biserial
correlation (r2pb), provides a useful measure of the
proportion of variability in the dependent variable
that can be accounted for by variability in the
independent variable
 η2 = r2pb = t2 / (t2 + df)
 For our example η2 = 4.97/ (4.97+18) = .22
Interpreting Eta-Squared

 Cohen suggested the following guidelines for
interpreting eta-squared
  ◦ .01-.05 is a small association
  ◦ .06-.14 is a medium association
  ◦ .15 + is a large association
 Therefore, 22% of the variability in motivation
can be attributed to the difference in the times of
the classes, which we can interpret as a large
effect
Omega-Squared

 Some authors have reported that the eta-
squared statistic is biased and have instead
recommended a modified version of the eta-
squared statistic, called omega-squared (ω2)
 ω2 = (t2 - 1) / (t2 + df + 1)
 For our example: ω2 = 3.97/23.97 = .17 (which
would still be considered a large effect)
The Variance Homogeneity
Assumption
 In computing the two independent-samples t
test in the previous example one of the
assumptions was that the variances of the two
groups were equal
 However, recall that s21=45.7 and s22=1.49
 What effect does this have on our test statistic?
Not much, unless the sample size are unequal
Sample Size & Variance
Heterogeneity
 When both sample sizes and variances are
unequal the t-test can become severely biased
(with respect to Type I and Type II error rates)
 Why is this?



             ( n1 - 1) s + ( n 2 - 1) s
                        2
                        1
                                          2
                                          2
    sp =
                    n1 + n 2 - 2
Positively and Negatively Paired
Sample Sizes and Variances

 Liberal Test: When the larger n is paired with the
smaller s2 (and the smaller n is paired with the
larger s2) the empirical Type I error rate is inflated
 Conservative Test: When the larger n is paired
with the larger s2 (and the smaller n is paired with
the smaller s2) the empirical Type I error rate is
deflated
How to Detect Variance
Heterogeneity (Levene)
 Levene’s Test: Levene developed a test of
variance homogeneity that tests the null
hypothesis that the group variances are equal
  ◦ Therefore, a significant Levene test (i.e., p ≤α, where α
    is usually set at .10) indicates that the variances are
    not equal
 The ‘lawstat’ package in R has an excellent
levene.test function that provides modified
versions of the test based on the median or
trimmed mean
How to Detect Variance
Heterogeneity (Variance Ratios)
 Another way to determine if the variances are
unequal is to just look at the ratio of the largest to
smallest variance
 Ratios larger than 2:1 (for unequal ns) or 4:1 (for
equal ns) indicate variance heterogeneity
 Note that both methods for detecting variance
heterogeneity are affected by nonnormality
(although the Levene test is less affected,
especially when used with the median or trimmed
mean)
Welch (1938) Test Statistic
 A statistic developed by Welch can be used to
test for mean equality when the variances of the
two groups are not equal
 The Welch statistic is reported in SPSS as the two
independent samples t with “equal variances not
assumed”


                X   1        -     X 2
         t'=             2            2
                    s   1           s2
                                 +
                    n   1           n 2
Sampling Distribution of t’
 The sampling distribution of t’ is very difficult to
completely determine, although t’ is
approximately distributed as t with df’ degrees of
freedom, where:


                                    2
                       æs 2
                          1  s ö2
                                2
                       ç   +    ÷
                       è n1 n 2 ø
        df ' =
                      s14           s2 4

                              + 2
                 n 1 (n 1 - 1) n 2 (n 2 - 1)
                   2
Welch t for the Motivation Example


  If we compare the motivation of the 8:30 and
 11:30 a.m. classes using t’ we find that t=1.27
 with df’=4.10
  Rounding off df’ to 4, we have a critical t of
 2.132
  Therefore, with the Welch t, there is no
 significant difference
Welch t for the Motivation Example
 Why might we find a significant difference
between the 8:30 and 11:30 a.m. classes on
motivation with the Student t, but not with the
Welch t?
 Take a look at the pattern of the sample sizes
and variances (negatively paired)
  ◦ 8:30 class: n = 5, s1 = 6.76
  ◦ 11:30 class: n 15, s2 = 1.22
 The significant independent samples t test result
may have been a Type I error
How can I do a two independent-
samples Welch t test in R?
 motiv<-c(12, 1, 14, 15, 2, 6, 4, 4, 6, 3, 6, 3, 7,
  5, 6, 5, 4, 5, 4, 6)
 class<-rep(c("8:30","11:30"),c(5,15))
 t.test(motiv~class,conf.level=.9)
 ◦ Welch Two Sample t-test data: motiv by class
 ◦ t = -1.2721, df = 4.088, p-value = 0.2709
   alternative hypothesis: true difference in means is
   not equal to 0
 ◦ 90 % confidence interval: -10.307118 2.573785
Welch t or Student t????
 When variances are equal the Welch t (t’) is only
slightly less powerful than the Student t
 Further, when variances are unequal the Welch t
maintains Type I error rates at the nominal level
(i.e., α)
 So why do we not always use Welch???

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:6/14/2013
language:English
pages:49