Today�s Lecture Topics - PowerPoint
Document Sample


Today’s Lecture Topic
• The Analysis of Variance
– Rationale behind the test
– Assumptions of an ANOVA
– Setting up the problem
– Computing the statistic
– Interpreting the results
Reference Material
• Burt and Barber, pages 479-480
• Disclaimer! Your book gives a very short
treatment of this subject but I feel that it is
an important tool and deserves special
attention
Recall the Two Sample Layout
•Two weeks ago we looked at the two sample
layout and learned the T-Test for assessing the
difference between two sample means
•Last week we revisited the idea behind the T-
test and discussed between sample variability
and within sample variability
•Today we are going to explore an extension of
the two sample layout
•We will start with the following question:
What if you had more than two samples?
Variables
• A Variable is a characteristic that we expect to change
• When we test hypotheses, there is often underlying
variables in our topic of interest
• For example, last weeks homework had us looking at
morphologic unit density in streams. This characteristic
varied across space and was therefore a variable
• But the test we ran wasn’t completely concerned with
morphologic unit density, we were actually interested in
whether or not the morphologic units varied above the
tributary and below the tributary
• This brings up an important concept- relationships
between variables
Dependence and Independence
• The morphologic units clearly have no effect on the
location in the stream, but the reverse may not be the
case
• Our statistical test suggested that the location relative
to the tributary had a significant effect on the
morphologic unit density
• So what we should recognize here is the potential
relationship between two variables
– One which is clearly independent of the other
– And one that is potentially dependent upon the other
– Yet we must remember that a statistical relationship does not
guarantee causality
Back to Our Question
• What if we were interested in comparing the
means of multiple samples?
• What would be the layout of such a
situation?
• Chalkboard examples:
– Death Penalty and Republicans and Democrats
– Death Penalty and Regions
Multiple Categories Create
Problems
• If we wanted to run a T-test on all potential
regional pairings, we would end up having to run
k(k-1) T-tests (k is the number of regions)
• There are problems with this approach
– 1st – it is a lot of computational work
– 2nd – it has an underlying weakness with respect to
Alpha or Type I error
– As we run multiple tests the chance of us committing at
least one alpha error is greater than the alpha level for
just a single test
• We are willing to go down this road, but not until
after we have determined if it is statistically
necessary to do so
What Should We Do?
• Clearly multiple T-Tests are a dangerous
and work intensive option
• We need a different approach to resolve our
question in a single test
• Fortunately such an approach exists and is a
fairly straight forward adaptation of the T-
Test
Analysis of Variance
• The ANalysis Of VAriance (or ANOVA)
operates with a null hypothesis that the
populations from which our multiple
samples are drawn are equal on the
characteristic of interest (our dependent
variable)
• This null takes the form: μ1=μ2=μ3=…=μk-
2=μk-1=μk
• As usual this null is of no difference
Assumptions and Limitations
• Independent Random Samples
• Level of measure on the characteristic (dependent
variable) is interval-ratio
• Populations are normally distributed
• Population variances are equal
• If the sample sizes for each category are the same,
the test can handle some violation of the
assumptions, but if your sample sizes are unequal
or the assumptions are grossly violated, you will
have to use a non-parametric test
Working with Data: An Example
Capital Punishment By Region
Survey Data (number of favorable responses)
North East Midwest Great Plains/Rockies Pacific Northwest Southwest South
Mean 6.4 6.6 8.3 5.3 7.4 8.8
Standard Deviation 0.9 1.2 1.8 0.9 1.1 0.7
Notice the data above, each category (region) has a mean and standard deviation.
The means represent the central value of each category and can be used to compare
between categories, while the standard deviation (and its square – variance)
represents within category variation
Although the layout suggests a comparison of means, the computations actually
involve developing two separate estimates of the population variance (hence the
name analysis of variance)
So what jumps out at us from the data above?
Computations
• Before we start with the equations lets look at
what an Analysis of Variance does
• First off, it creates two estimates of population
variance
– The first is known as the sum of squares between
– The second is known as the sum of squares within
• Together these sum to the total sum of squares
• Mathematically the relationship between the three
looks like this:
• SST=SSB+SSW
Calculating the Sum of Squares
• The sum of squares within n
is very similar to what we SSW ( X i X k ) 2
i 1
calculate regularly when
we compute a samples
variance
• n is the size of the sample
for the category that we are You calculate a SSW
calculating the SSW for for each category or
sample and then sum
• k indicates that we are them all for the total
taking the mean of the kth SSW
category
Calculating the Sum of Squares
• The sum of squares
between denotes the SSB nk ( X k X ) 2
variability between
samples or categories
• nk is the number of This computation is
observations in a category run on all categories
(its size) and uses the “global”
or total mean which is
• k indicates that we are defined as the sum of
taking the mean of the kth all observations
category and comparing it divided by N
to the global mean
Calculating the Sum of Squares
• The sum of squares total is N
the sum of squares that we SST ( X i X ) 2
i 1
are used to seeing when we
compute the variance
• In this case, it is a sum of
squares on all the data
• All the sum of square
computations are relatively
easy in a spread sheet but
there are computational
shortcuts available
Shortcuts to the Sum of Squares
• Since SST=SSW+SSB if we
SST X N X
2 2
can find two, we can compute
for the other
• SSB is pretty easy to calculate
because you are working with This is the sum of
the categories only all X squared minus
N times the global
• SST has a shortcut that you can mean squared
use for an easier computation
• SSW=SST-SSB so you can
find it without actually
calculating it directly
Degrees of Freedom
• The df for each type of sum of squares is
fairly easy to calculate
• dfw is the df within and it is defined as N-k
– N is the number of cases
– k is the number of categories or samples
• dfb is the df between and it is defines as k-1
– k is the number of categories
– 1 is the integer that comes before 2
Putting it all together
• Once we have the sum of squares and the
degrees of freedom, we can combine them
to create estimates of variance that are
known as mean square estimates
• The mean square within is simply the
SSW/dfw
• The mean square between is simply the
SSB/dfb
• These two can be combined in the following
fashion to create a statistic that is called the
F-Ratio - F=MSB/MSW
What is an F-Ratio?
• Since the F-Ratio is the result of the Mean Square
Between / Mean Square Within, it is a function of the
amount of variation between categories to the amount
of variation within categories
• As the SSB increases, the between category variation
increases and thus the F-Ratio increases
• As the SSW increases, the within category variation
increases and thus the F-Ratio decreases
SSB / dfb
F Ratio
SSW / dfw
Off to Excel
Finding the Result
• Our SSB is 66 and our SSW is 54, but since our
dfb is almost always much smaller than our dfw,
the result looks significant
• When we divide by the degrees of freedom
(dfb=42 and dfw=5) we find that our mean square
values are 13.27 between and 1.29 within giving
us an F-Ratio of 10.3
• Given the df we can find the result of this test on
a F-Table at a given significance and determine
that it is significant at a p-value of 0.01
F=10.3,
dfw=42, dfb=5
So the resulting p-value
is <0.01
What does this mean?
• Since we now know that there is a statistically
significant level of variation between
categories, our next task would be to determine
which categories are statistically separable
Capital Punishment By Region
Survey Data (number of favorable responses)
North East Midwest Great Plains/Rockies Pacific Northwest Southwest South
Mean 6.4 6.6 8.3 5.3 7.4 8.8
Standard Deviation 0.9 1.2 1.8 0.9 1.1 0.7
I’d start with a T-Test on the PNW vs S,
then I’d run PNW vs GP/R, then I’d run
PNW vs SW, NE and MW and eventually I
would find that some regions can be
combined on this issue
Wrap Up and Homework
• Once again, we will be doing a single homework
assignment for the week
• This weeks assignment will be using the one-way
ANOVA and its non-parametric equivalent to resolve
the same question
• Take note that the website has been enhanced with
the addition of some statistical summaries and
reference data
• The rest of class will be spent on the past two weeks
homework so feel free to leave if you have no
questions
Get documents about "