VIEWS: 42 PAGES: 37 POSTED ON: 3/8/2011
One-way Between Subjects Analysis of Variance Multiple t-tests vs. ANOVA ▪ Example: Dr. Smith wants to compare the perfectionism scores of grad students in Psychology, Sociology, History and Math ▪ Why do we (usually) run a one-way ANOVA to test for mean differences instead of running C=k(k-1)/2 t-tests (k=number of groups, j=1,...,k)? ▪ The answer you probably received was that it controlled the overall type I error rate ... but is this correct??? Multiple t-tests vs. ANOVA ▪ It is true that if you run k(k-1)/2 = 4(4-1)/2 = 6 tests to compare the programs (each at level α) that you have approximately a 1-(1-α)C (≃Cα) chance of making a Type I error, but what if we impose some type of multiplicity control? – For example, if we conduct each of the 6 t-tests at αPC = α/C, then the overall Type I error rate won‟t exceed α ▸ In fact, many pairwise multiple comparison procedures (that locate pairwise mean differences) are intended to be used without an omnibus test (more on that later) Multiple t-tests vs. ANOVA ▪ Another important distinction between the omnibus ANOVA F test and multiple t tests is that even though the ANOVA F is not significant, there could still be significant pairwise differences (even when controlling for inflated Type I errors) ▸ The ANOVA F is pooling all of the mean differences and therefore it is not a precise test of whether there are any significant pairwise mean differences One-way ANOVA ▪ Regardless of the fact that the one-way ANOVA may not be a necessary tool for analysis, it remains popular because: ▸ It provides an easy way to summarize null findings ▸ It illustrates the use of a pooled error term ▸ It provides the basis for more complicated omnibus tests, such as tests of higher-order interactions ▸ Some multiple comparison procedures do require that an omnibus test be used as a preliminary test of the existence of mean differences One-way Between-Subjects ANOVA Model ▪ Yij = μ + τj + εij, where – Yij is the score of the ith subject (i=1,...,nj) in the jth group (j=1,...,k) – μ is the population grand mean – τj is the fixed treatment effect for the jth group (μj -μ) – εij is the random error component for the ith subject in the jth group (Yij - μj) ▪ Assumptions: – εij ∼NID (0, σ2) (more on that to come) ▪ Hypotheses: – Ho: μ1 = μ2 = ... = μk – H1: The population means are not all equal – Recall why Ho: μ1 ≠ μ2 ≠ ... ≠ μk is not correct Partitioning the Variability ▪ Between Group variability: differences between the mean scores in each group ▸ Why do mean differences exist? – Effect of the IV on the DV (or relationship between the IV and DV if we are using naturally occurring groups) – Error ▪ Within Group variability: variability of the scores within the groups ▸ Why do scores within the groups differ? – Error Understanding the One-way ANOVA ▪ ANOVA F (after Fisher) = ratio of between group variability to within group variability Between Group Variability F= Within Group Variability s t2 + s E 2 MS treatment F = = sE2 MS error Calculations n j X j X .. 2 SStreat j MStreat = = df treat k1 X 2 SSerror j i ij Xj MSerror = = df error Nk X 2 SStotal j i ij X .. MStotal = = df total N1 ANOVA Summary Table Source SS df MS F Treatment SStreat dftreat MStreat = F= SStreat / dftreat MStreat / MSerror Error SSerror dferror MSerror = SSerror / dferror Total SStotal dftotal ▪ Ho: μ1 = μ2 = ... = μk is rejected if F ≥Fα,dft, dfe ▪ Recall that the ANOVA F tests the global hypothesis that there are no differences between groups (although differences between groups may still exist!) Example ▪ A researcher is interested in determining if the fatigue levels (0-15) of married women differ as a function of how they classify their husbands involvement in housework (not involved, somewhat involved, involved) ▪ Data: ▸ Not involved: 9, 12, 4, 8, 7 ▸ Somewhat involved: 4, 6, 8, 2, 10 ▸ Involved: 1, 3, 4, 5, 2 Example, cont’d ▪ Ho: μNI = μSI = μI ▸ SStot = ΣX2 - [(ΣX)2/N] = 629 - [(85)2/15] = 147.33 ▸ SStreat = {Σj [(ΣXj)2/nj]} - [(ΣX)2/N] = [(402/5)+(302/5)+(152/5)] - [(85)2/15]= 63.33 ▸ SSerror = ΣX2 - {Σj [(ΣXj)2/nj]} = 629 - [(402/5)+(302/5)+(152/5)] = 84.00 Example, cont’d ▸ dftot = N - 1 = 15 - 1 = 14 ▸ dftreat = k - 1 = 3 - 1 = 2 ▸ dferror = N - k = 15 - 3 = 12 ▸ MStreat = SStreat / dftreat = 63.33 / 2 = 31.67 ▸ MSerror = SSerror / dferror = 84 / 12 = 7.00 ▪ F = MStreat / MSerror = 31.67 / 7.00 = 4.52 ▸ F.05,2,12 = 3.88 ▸ R/SPSS p-value = .034 ▪ Therefore, we reject the null hypothesis that the means are all equal Strength of the Relationship ▪ Recall that a significant F test does not tell us how „strong‟ the relationship is ▪ η2 = proportion of variability in the DV that can be explained by the IV ▪ η2 = SStreat / SStotal ▪ For our example, η2 = 63.33 / 147.33 = .43 ▪ 43% of the variability in fatigue is explained by the husband‟s involvement in housework (large effect!) Strength of the Relationship - ω2 ▪ η2 provides a slightly biased (upwards) estimate of the strength of the relationship between an IV and a DV ▪ Therefore, several authors have recommended the use of ω2 as an alternative to η2 SS treat ( k 1) MS error 63.33 (2)(7) = 2 = = .32 SS total + MS error 147.33 + 7 Effect Size – f/Φ' ▪ Another useful measure of effect size is Cohen‟s f (which Howell calls Φ') ▪ f is a standardized mean difference statistic that is very similar to the d-family based RMSSE effect size measure that Howell also presents ▸ We use f/Φ' because this measure will also be used in power calculations Effect Size – f/Φ' 2 j /k f = ' = j s e2 ▸ In the absence of useful information for interpreting f, Cohen recommended: – Small = .1 - .25, med = .25 - .4, large = .4+ Power Calculations ▪ When planning any study, it is important that we investigate power a priori ▪ As in the two independent samples design, we need an estimated effect size in order to calculate power (note that j = 1, ..., k) 2 j /k ' = j s e2 Power Calculations, cont’d ▪ When calculating the effect size, it is important to consider what differences among the groups would be meaningful (relative to the variability of the groups) ▸ If no information is available for calculating an effect size, Cohen suggests the following for ϕ‟: – .10 = small, .25 = medium, .40 = large ▪ Incorporating the sample size and effect size gives: = n Power Calculation Example ▪ Dr. Jones wants to compare three different cultural groups on levels of conservativeness ▪ How much power would Dr. Jones have with 10 subjects per group, meaningful differences in the means of: european = 4, south asian = 6, middle east = 8, and average error variance of 3 2 j /k ' = se 2 (4 6) 2 + (6 6) 2 + (8 6) 2 / 3 = = .942 3 Power Calculation Example, cont’d ▪ Therefore, ϕ=.942 X sqrt(10) = 2.98 ▪ From Appendix “ncF” ▸ ϕ = 2.98, dft = 2, dfe = N-k = 30-3 = 27, and α = .05 ▸ Power = 1 -.01 = .99 (or 99% power) ▪ Note that in order to calculate the n required for a given power we still calculate the meaningful effect size, but then we reorganize the formula for ϕ n = / 2 2 Sample Size Calculation Example ▪ Dr. Jones wants to compare three different cultural groups on levels of conservativeness ▸ How many subjects would Dr. Jones need in order to have 90% power with meaningful differences in the means of: european = 4, south asian = 6, middle east = 8, and average error variance of 3 2 j /k ' = se 2 (4 6) 2 + (6 6) 2 + (8 6) 2 / 3 = = .942 3 Sample Size Calculation Example 2 2.2 2 n= 2 = 2 = 5.45 .942 ▪ Note that ϕ (2.2) comes from Appendix ncF, and is the value of ϕ that most closely approximates 90% power for an appropriate error df ▪ With n = 6 subjects per group, and ϕ‟=.942, G*Power gives us a power of .91 ▸ Note in G*Power that f = ϕ‟ Assumptions ▪ The assumptions required for obtaining a valid F test are: ▸ Samples are randomly and independently selected from their respective populations ▸ Scores in each population are normally distributed ▸ Variances in each population are equal ▪ Note: ▸ The independence assumption is extremely important and should be considered in the design of the experiment Assumptions, cont’d ▪ Consequences of violating assumptions: – If sample sizes and variances are unequal, Type I error rates can deviate considerably from α – Positively paired ns and σ2s produce a conservative F – Negatively paired ns and σ2s produce a liberal F – However, with more than two groups it is often more difficult to identify a pattern – If data are nonnormal, Type I error rates may not deviate much from α, however the power of other procedures may be much higher than the F test – If data are nonnormal and variances are unequal the F test becomes severely biased with respect to both Type I and Type II error rates Alternatives to the ANOVA F ▪ Unequal Variances ▸ When variances are determined to be unequal (Levene/variance ratio tests) the omnibus Welch test can be adopted ▸ Transformations may also be useful when the means and variances/standard deviations are proportional Welch Test wj X j X . ' 2 k1 F' = 2 2( k 2) 1 wj 1+ n 1 1 k2 1 j wj wj = nj X = ' w X j j w . s2 j j k2 1 df ' = 2 1 wj 3 n 1 j wj Alternatives to the ANOVA F ▪ Nonnormality ▸ When the distributions are nonnormal (but similar in shape) and the variances are equal a nonparametric test (e.g., Kruskal-Wallis) can provide much more power than the ANOVA F ▸ Transformations may also be useful to make the distribution shapes more normal ▸ Trimmed means also help to reduce the effects of extreme observations Kruskal-Wallis Test ▪ The Kruskal-Wallis H test is a nonparametric procedure that can be used to compare k independent populations ▪ All N = n1 + n2 + ... + nk observations are jointly ranked (i.e., treated as one large sample when applying the ranks) ▸ As with the Mann-Whitney test, tied observations are assigned the average of the ranks they occupy ▪ Calculate T1, T2, ... TK, where T is the sum of the ranks for each group Kruskal-Wallis Test ▪ The null hypothesis is rejected if H is greater than a critical χ2 value (df = k - 1) ▸ Ho: There are no differences between the groups ▸ H1: There are differences between the groups – Recall: Kruskal-Wallis is a test of mean differences only if we can assume that the distributions are the same shape and that the variances of the groups are equal 2 12 T H = N ( N + 1) n 3 N + 1 Kruskal-Wallis Example ▪ Four groups of students were randomly assigned to be taught with one of four different techniques, and their achievement scores were recorded. Are the distributions of test scores the same? (i.e., are all the groups the same?) ▸ Data (ranks are in parentheses) Method 1 Method 2 Method 3 Method 4 65 (3) 75 (7) 59 (1) 94 (16) 87 (13) 69 (5) 78 (8) 89 (15) 73 (6) 83 (12) 67 (4) 80 (10) 79 (9) 81 (11) 62 (2) 88 (14) Kruskal-Wallis Example ▪ Sum of the Ranks for Each Group ▸ T1 = 31, T2 = 35, T3 = 15, T4 = 55 ▸ χ2 critical (α = .05, df = k - 1 = 4 - 1 = 3) is 7.81 (Appendix B.8) 12 T2 N ( N + 1) H = 3 N + 1 n 12 31 2 35 2 15 2 55 2 = + + + 3(17 ) 16 (17 ) 4 4 4 4 = 8.96 Kruskal-Wallis Example ▪ Therefore, since H (8.96) > χ2 critical (7.81) we would reject the null hypothesis and conclude that the teaching methods differ ▪ As with the one-way ANOVA, we would most likely follow-up this result in order to determine exactly where differences between the conditions exist ▸ These tests can be conducted with the non- parametric two independent samples Mann- Whitney test Alternatives to the ANOVA F ▪ Nonnormality/Unequal Variances ▸ When distributions are nonnormal and variances are unequal the Welch test on trimmed means or the Welch test on ranks can provide accurate Type I error rates for most conditions – As we have dealt in depth with these tests in the past, we won‟t deal with them in detail now. The same procedures that were applied for the two independent-samples design are also applied here (also, as before, an R function for computing the one-way F test with trimmed means is available on my website) Alternatives to the ANOVA F ▪ Nonnormality/Unequal Variances ▸ Transformations may also be effective at equating the variances and normalizing the data, especially when the means and vars/sds are proportional – Transformations become complicated with k>2 groups because often the groups have different shapes and thus one transformation will not normalize all of the groups ▸ Resampling procedures may also be extremely effective, although their current use is limited by their availability – e.g., Howell only presents resampling procedures up to the two-group case Notes Regarding Effect Sizes ▪ Currently standardized effect size statistics for the Welch tests are not widely available, and therefore at this time simple mean differences will suffice ▪ For the Kruskall-Wallis, it is important to realize that that is just the ANOVA on ranks (though the K-W has adjustments for ties, etc.) ▸ Therefore, performing an ANOVA on ranks will allow you to produce eta-squared (or omega- squared) Extension: Equivalence Tests ▪ Recall that if the goal of a study is to test whether all group/conditions are EQUIVALENT on the outcome measure, then the one-way tests just discussed are not appropriate ▸ In other words, you would want to accept the null, which is not appropriate with standard null hypothesis testing procedures ▪ One-way tests of equivalence should be used when the goal is to demonstrate that multiple groups are equivalent on an outcome