VIEWS: 6 PAGES: 24 POSTED ON: 9/12/2012 Public Domain
Today’s Lecture Topic • The Analysis of Variance – Rationale behind the test – Assumptions of an ANOVA – Setting up the problem – Computing the statistic – Interpreting the results Reference Material • Burt and Barber, pages 479-480 • Disclaimer! Your book gives a very short treatment of this subject but I feel that it is an important tool and deserves special attention Recall the Two Sample Layout •Two weeks ago we looked at the two sample layout and learned the T-Test for assessing the difference between two sample means •Last week we revisited the idea behind the T- test and discussed between sample variability and within sample variability •Today we are going to explore an extension of the two sample layout •We will start with the following question: What if you had more than two samples? Variables • A Variable is a characteristic that we expect to change • When we test hypotheses, there is often underlying variables in our topic of interest • For example, last weeks homework had us looking at morphologic unit density in streams. This characteristic varied across space and was therefore a variable • But the test we ran wasn’t completely concerned with morphologic unit density, we were actually interested in whether or not the morphologic units varied above the tributary and below the tributary • This brings up an important concept- relationships between variables Dependence and Independence • The morphologic units clearly have no effect on the location in the stream, but the reverse may not be the case • Our statistical test suggested that the location relative to the tributary had a significant effect on the morphologic unit density • So what we should recognize here is the potential relationship between two variables – One which is clearly independent of the other – And one that is potentially dependent upon the other – Yet we must remember that a statistical relationship does not guarantee causality Back to Our Question • What if we were interested in comparing the means of multiple samples? • What would be the layout of such a situation? • Chalkboard examples: – Death Penalty and Republicans and Democrats – Death Penalty and Regions Multiple Categories Create Problems • If we wanted to run a T-test on all potential regional pairings, we would end up having to run k(k-1) T-tests (k is the number of regions) • There are problems with this approach – 1st – it is a lot of computational work – 2nd – it has an underlying weakness with respect to Alpha or Type I error – As we run multiple tests the chance of us committing at least one alpha error is greater than the alpha level for just a single test • We are willing to go down this road, but not until after we have determined if it is statistically necessary to do so What Should We Do? • Clearly multiple T-Tests are a dangerous and work intensive option • We need a different approach to resolve our question in a single test • Fortunately such an approach exists and is a fairly straight forward adaptation of the T- Test Analysis of Variance • The ANalysis Of VAriance (or ANOVA) operates with a null hypothesis that the populations from which our multiple samples are drawn are equal on the characteristic of interest (our dependent variable) • This null takes the form: μ1=μ2=μ3=…=μk- 2=μk-1=μk • As usual this null is of no difference Assumptions and Limitations • Independent Random Samples • Level of measure on the characteristic (dependent variable) is interval-ratio • Populations are normally distributed • Population variances are equal • If the sample sizes for each category are the same, the test can handle some violation of the assumptions, but if your sample sizes are unequal or the assumptions are grossly violated, you will have to use a non-parametric test Working with Data: An Example Capital Punishment By Region Survey Data (number of favorable responses) North East Midwest Great Plains/Rockies Pacific Northwest Southwest South Mean 6.4 6.6 8.3 5.3 7.4 8.8 Standard Deviation 0.9 1.2 1.8 0.9 1.1 0.7 Notice the data above, each category (region) has a mean and standard deviation. The means represent the central value of each category and can be used to compare between categories, while the standard deviation (and its square – variance) represents within category variation Although the layout suggests a comparison of means, the computations actually involve developing two separate estimates of the population variance (hence the name analysis of variance) So what jumps out at us from the data above? Computations • Before we start with the equations lets look at what an Analysis of Variance does • First off, it creates two estimates of population variance – The first is known as the sum of squares between – The second is known as the sum of squares within • Together these sum to the total sum of squares • Mathematically the relationship between the three looks like this: • SST=SSB+SSW Calculating the Sum of Squares • The sum of squares within n is very similar to what we SSW ( X i X k ) 2 i 1 calculate regularly when we compute a samples variance • n is the size of the sample for the category that we are You calculate a SSW calculating the SSW for for each category or sample and then sum • k indicates that we are them all for the total taking the mean of the kth SSW category Calculating the Sum of Squares • The sum of squares between denotes the SSB nk ( X k X ) 2 variability between samples or categories • nk is the number of This computation is observations in a category run on all categories (its size) and uses the “global” or total mean which is • k indicates that we are defined as the sum of taking the mean of the kth all observations category and comparing it divided by N to the global mean Calculating the Sum of Squares • The sum of squares total is N the sum of squares that we SST ( X i X ) 2 i 1 are used to seeing when we compute the variance • In this case, it is a sum of squares on all the data • All the sum of square computations are relatively easy in a spread sheet but there are computational shortcuts available Shortcuts to the Sum of Squares • Since SST=SSW+SSB if we SST X N X 2 2 can find two, we can compute for the other • SSB is pretty easy to calculate because you are working with This is the sum of the categories only all X squared minus N times the global • SST has a shortcut that you can mean squared use for an easier computation • SSW=SST-SSB so you can find it without actually calculating it directly Degrees of Freedom • The df for each type of sum of squares is fairly easy to calculate • dfw is the df within and it is defined as N-k – N is the number of cases – k is the number of categories or samples • dfb is the df between and it is defines as k-1 – k is the number of categories – 1 is the integer that comes before 2 Putting it all together • Once we have the sum of squares and the degrees of freedom, we can combine them to create estimates of variance that are known as mean square estimates • The mean square within is simply the SSW/dfw • The mean square between is simply the SSB/dfb • These two can be combined in the following fashion to create a statistic that is called the F-Ratio - F=MSB/MSW What is an F-Ratio? • Since the F-Ratio is the result of the Mean Square Between / Mean Square Within, it is a function of the amount of variation between categories to the amount of variation within categories • As the SSB increases, the between category variation increases and thus the F-Ratio increases • As the SSW increases, the within category variation increases and thus the F-Ratio decreases SSB / dfb F Ratio SSW / dfw Off to Excel Finding the Result • Our SSB is 66 and our SSW is 54, but since our dfb is almost always much smaller than our dfw, the result looks significant • When we divide by the degrees of freedom (dfb=42 and dfw=5) we find that our mean square values are 13.27 between and 1.29 within giving us an F-Ratio of 10.3 • Given the df we can find the result of this test on a F-Table at a given significance and determine that it is significant at a p-value of 0.01 F=10.3, dfw=42, dfb=5 So the resulting p-value is <0.01 What does this mean? • Since we now know that there is a statistically significant level of variation between categories, our next task would be to determine which categories are statistically separable Capital Punishment By Region Survey Data (number of favorable responses) North East Midwest Great Plains/Rockies Pacific Northwest Southwest South Mean 6.4 6.6 8.3 5.3 7.4 8.8 Standard Deviation 0.9 1.2 1.8 0.9 1.1 0.7 I’d start with a T-Test on the PNW vs S, then I’d run PNW vs GP/R, then I’d run PNW vs SW, NE and MW and eventually I would find that some regions can be combined on this issue Wrap Up and Homework • Once again, we will be doing a single homework assignment for the week • This weeks assignment will be using the one-way ANOVA and its non-parametric equivalent to resolve the same question • Take note that the website has been enhanced with the addition of some statistical summaries and reference data • The rest of class will be spent on the past two weeks homework so feel free to leave if you have no questions