Today�s Lecture Topics - PowerPoint by 8g4KEO


									         Today’s Lecture Topic
• The Analysis of Variance
  –   Rationale behind the test
  –   Assumptions of an ANOVA
  –   Setting up the problem
  –   Computing the statistic
  –   Interpreting the results
           Reference Material
• Burt and Barber, pages 479-480
• Disclaimer! Your book gives a very short
  treatment of this subject but I feel that it is
  an important tool and deserves special
  Recall the Two Sample Layout
•Two weeks ago we looked at the two sample
layout and learned the T-Test for assessing the
difference between two sample means
•Last week we revisited the idea behind the T-
test and discussed between sample variability
and within sample variability
•Today we are going to explore an extension of
the two sample layout
•We will start with the following question:
What if you had more than two samples?
• A Variable is a characteristic that we expect to change
• When we test hypotheses, there is often underlying
  variables in our topic of interest
• For example, last weeks homework had us looking at
  morphologic unit density in streams. This characteristic
  varied across space and was therefore a variable
• But the test we ran wasn’t completely concerned with
  morphologic unit density, we were actually interested in
  whether or not the morphologic units varied above the
  tributary and below the tributary
• This brings up an important concept- relationships
  between variables
     Dependence and Independence
• The morphologic units clearly have no effect on the
  location in the stream, but the reverse may not be the
• Our statistical test suggested that the location relative
  to the tributary had a significant effect on the
  morphologic unit density
• So what we should recognize here is the potential
  relationship between two variables
   – One which is clearly independent of the other
   – And one that is potentially dependent upon the other
   – Yet we must remember that a statistical relationship does not
     guarantee causality
        Back to Our Question
• What if we were interested in comparing the
  means of multiple samples?

• What would be the layout of such a
• Chalkboard examples:
  – Death Penalty and Republicans and Democrats
  – Death Penalty and Regions
     Multiple Categories Create
• If we wanted to run a T-test on all potential
  regional pairings, we would end up having to run
  k(k-1) T-tests (k is the number of regions)
• There are problems with this approach
   – 1st – it is a lot of computational work
   – 2nd – it has an underlying weakness with respect to
     Alpha or Type I error
   – As we run multiple tests the chance of us committing at
     least one alpha error is greater than the alpha level for
     just a single test
• We are willing to go down this road, but not until
  after we have determined if it is statistically
  necessary to do so
        What Should We Do?
• Clearly multiple T-Tests are a dangerous
  and work intensive option
• We need a different approach to resolve our
  question in a single test

• Fortunately such an approach exists and is a
  fairly straight forward adaptation of the T-
        Analysis of Variance
• The ANalysis Of VAriance (or ANOVA)
  operates with a null hypothesis that the
  populations from which our multiple
  samples are drawn are equal on the
  characteristic of interest (our dependent
• This null takes the form: μ1=μ2=μ3=…=μk-

• As usual this null is of no difference
   Assumptions and Limitations
• Independent Random Samples
• Level of measure on the characteristic (dependent
  variable) is interval-ratio
• Populations are normally distributed
• Population variances are equal

• If the sample sizes for each category are the same,
  the test can handle some violation of the
  assumptions, but if your sample sizes are unequal
  or the assumptions are grossly violated, you will
  have to use a non-parametric test
        Working with Data: An Example
Capital Punishment By Region
Survey Data (number of favorable responses)
                      North East Midwest Great Plains/Rockies Pacific Northwest Southwest South
Mean                           6.4        6.6               8.3             5.3        7.4   8.8
Standard Deviation             0.9        1.2               1.8             0.9        1.1   0.7

  Notice the data above, each category (region) has a mean and standard deviation.
  The means represent the central value of each category and can be used to compare
  between categories, while the standard deviation (and its square – variance)
  represents within category variation

  Although the layout suggests a comparison of means, the computations actually
  involve developing two separate estimates of the population variance (hence the
  name analysis of variance)

  So what jumps out at us from the data above?
• Before we start with the equations lets look at
  what an Analysis of Variance does
• First off, it creates two estimates of population
   – The first is known as the sum of squares between
   – The second is known as the sum of squares within
• Together these sum to the total sum of squares
• Mathematically the relationship between the three
  looks like this:
   Calculating the Sum of Squares
• The sum of squares within               n

  is very similar to what we   SSW   ( X i  X k ) 2
                                        i 1
  calculate regularly when
  we compute a samples
• n is the size of the sample
  for the category that we are You calculate a SSW
  calculating the SSW for       for each category or
                                sample and then sum
• k indicates that we are       them all for the total
  taking the mean of the kth    SSW
  Calculating the Sum of Squares
• The sum of squares
  between denotes the          SSB   nk ( X k  X ) 2
  variability between
  samples or categories
• nk is the number of          This computation is
  observations in a category   run on all categories
  (its size)                   and uses the “global”
                               or total mean which is
• k indicates that we are      defined as the sum of
  taking the mean of the kth   all observations
  category and comparing it    divided by N
  to the global mean
   Calculating the Sum of Squares
• The sum of squares total is            N

  the sum of squares that we     SST   ( X i  X ) 2
                                        i 1
  are used to seeing when we
  compute the variance
• In this case, it is a sum of
  squares on all the data
• All the sum of square
  computations are relatively
  easy in a spread sheet but
  there are computational
  shortcuts available
  Shortcuts to the Sum of Squares
• Since SST=SSW+SSB if we
                                    SST   X  N X
                                                2       2
  can find two, we can compute
  for the other
• SSB is pretty easy to calculate
  because you are working with        This is the sum of
  the categories only                 all X squared minus
                                      N times the global
• SST has a shortcut that you can     mean squared
  use for an easier computation
• SSW=SST-SSB so you can
  find it without actually
  calculating it directly
         Degrees of Freedom
• The df for each type of sum of squares is
  fairly easy to calculate
• dfw is the df within and it is defined as N-k
  – N is the number of cases
  – k is the number of categories or samples
• dfb is the df between and it is defines as k-1
  – k is the number of categories
  – 1 is the integer that comes before 2
         Putting it all together
• Once we have the sum of squares and the
  degrees of freedom, we can combine them
  to create estimates of variance that are
  known as mean square estimates
• The mean square within is simply the
• The mean square between is simply the
• These two can be combined in the following
  fashion to create a statistic that is called the
  F-Ratio - F=MSB/MSW
            What is an F-Ratio?
• Since the F-Ratio is the result of the Mean Square
  Between / Mean Square Within, it is a function of the
  amount of variation between categories to the amount
  of variation within categories
• As the SSB increases, the between category variation
  increases and thus the F-Ratio increases
• As the SSW increases, the within category variation
  increases and thus the F-Ratio decreases

               SSB / dfb
   F  Ratio 
               SSW / dfw
Off to Excel
             Finding the Result
• Our SSB is 66 and our SSW is 54, but since our
  dfb is almost always much smaller than our dfw,
  the result looks significant
• When we divide by the degrees of freedom
  (dfb=42 and dfw=5) we find that our mean square
  values are 13.27 between and 1.29 within giving
  us an F-Ratio of 10.3
• Given the df we can find the result of this test on
  a F-Table at a given significance and determine
  that it is significant at a p-value of 0.01
dfw=42, dfb=5

 So the resulting p-value
 is <0.01
                     What does this mean?
     • Since we now know that there is a statistically
       significant level of variation between
       categories, our next task would be to determine
       which categories are statistically separable
Capital Punishment By Region
Survey Data (number of favorable responses)
                      North East Midwest Great Plains/Rockies Pacific Northwest Southwest South
Mean                           6.4        6.6               8.3             5.3        7.4   8.8
Standard Deviation             0.9        1.2               1.8             0.9        1.1   0.7

       I’d start with a T-Test on the PNW vs S,
       then I’d run PNW vs GP/R, then I’d run
       PNW vs SW, NE and MW and eventually I
       would find that some regions can be
       combined on this issue
        Wrap Up and Homework
• Once again, we will be doing a single homework
  assignment for the week
• This weeks assignment will be using the one-way
  ANOVA and its non-parametric equivalent to resolve
  the same question
• Take note that the website has been enhanced with
  the addition of some statistical summaries and
  reference data
• The rest of class will be spent on the past two weeks
  homework so feel free to leave if you have no

To top