Docstoc

Lecture_22

Document Sample
Lecture_22 Powered By Docstoc
					            Today’s Lecture
• Interpreting data and problems to help
  select the correct statistical test
• Introduction to the analysis of 3 or more
  variables
 The First Things that You Should
Do When Given Data and a Problem
 • First Question:
   – What type of data do I have?
   – What is the level of measure?
   – How many sets of data are there?
 • Second Question:
   – What is being asked of me in the question?
   – Does the question mention any key words like
     compare or associate?
            Narrowing the Range of
                 Possibilities
• In our class, I have taught you what I viewed to be the most
  applicable tests to the various types of data that you will
  encounter.
• There are entire groups of methods that deal with data forms that
  we didn't cover.
• What this means is that your options on the exam (although
  seemingly large) are actually quite limited.
• We spent the bulk of our time in hypothesis testing working on
  two types of statistical cases.
• The first was comparisons of samples via their means, medians,
  distributions, variances, etc.
• The second was the association of two variables at different
  levels of measurement.
            Samples and Variables
• Samples are the portion of a population that is observed.
• At their simplest, they are a representation of a larger
  group.
• Variables are measurable phenomenon whose values
  change from observation to observation.
• In statistics, samples of variables can exist for data at any
  level of measurement.
• Variables are often associated with one another, such
  associations can be spurious or a potential source of
  causality.
• If we were interested in comparing the AFC to
  the NFC, what would be the correct method.
• If we look at the data (point differential for each    Example
  team in the AFC and NFC) we can see that it is
  definitely a variable.                                AFC (PF-PA) NFC (PF-PA)
• But when we look at the column head, it would                   208          90
  be easy to consider AFC and NFC to be a                         -26          41
  categorical variable as well.                                   -69           -8
• But is this a two variable case? No, it isn't really.           -78            1
• There is only one variable here (point                           93          52
  differential). The categories are immaterial                     -6          15
  because our comparison is between the NFC and                     5          12
  AFC.                                                            -46         -37
• The nature of the test that we would use assumes                118          20
  in its null hypothesis that there is no difference               41         -24
  between the AFC and NFC.                                          4         -13
• It assumes that they are two samples from the                   -23         -54
  same population (the NFL).                                      -22          26
• So AFC and NFC are not a nominal variable in                     25         -17
  the statistical sense.                                         -101         -82
                                                                  -25        -120
          • Here we have Point Differential
Another     plotted vs Number of Wins
Example   • There are clearly at least two
            variables here
          • Any question or hypothesis
            would deal the association
            between two variables
                 10
                  9
                  8
                  7
                  6
          Wins




                                                                  AFC
                  5
                                                                  NFC
                  4
                  3
                  2
                  1
                  0
                   -200   -100     0          100     200   300
                                 Point Differential
        Decisions for One Variable
   If we have data with one continuous variable, then we have a
   number of options in terms of analysis (all of which are
   essentially comparisons of samples to samples or samples to
   populations)
             First Question: How many Samples?




One Sample       Two Samples                Three or More Samples
                       One Sample




Estimation of Parameters or Test of Distribution?

               Estimation              Distribution
                                                        One Sample
Population Parameters Known?        Goodness of Fit
     Yes              No
                                                      With one sample, the only
                                                      options are estimation of
                                                      population parameters (like
                                                      the mean, variance or
                                                      proportion), or
                                                      comparisons of the sample
                                                      distribution to a
                                                      hypothesized theoretical
                                                      distribution via a goodness
                                                      of fit (most commonly
                                                      done via a Chi-square Test)
Normal-
Distribution       T-Distribution   Chi-Square/K-S
                                               Two Samples
                                                              Two Samples


Question: Are samples dependent or independent of one another?

      Two Samples - Dependent or Paired                                            Two Samples - Independent

Question: Sample Size                                 Question: Sample Size
 large (>30) samples            small samples                 large (>30) samples                                    small samples

Question: Is the sample    Question: Is the sample
normally distributed       normally distributed       Question: Is the sample normally distributed     Question: Is the sample normally distributed
Check with K-S Test        Check with S-W Test        Check with K-S Test                              Check with S-W Test
 normal       not normal    normal       not normal                normal                 not normal                normal                not normal
                                                                                                       Questions: Are the variances
                                                      Question: Are the variances equal                equal

                                                      Check with Ratio of Variances                    Check with Ratio of Variances
                                                           Yes                No                            Yes                No


                                                                       T-Test (non-                                     T-Test (non-
Paired T- Wilcoxon Sign-   Paired T- Wilcoxon Sign-   T-Test (pooled   pooled             Wilcoxon- T-Test (pooled      pooled            Wilcoxon-
Test      Rank Test        Test      Rank Test        variance)        variance)          Rank Sum variance)            variance)         Rank Sum




    With Two Samples, we have to ask a minimum of three questions to
    ask.
          Two Samples - Continued
• Are the samples independent of one another (remember
  that paired cases require a slightly different approach)
• How large are our samples
   – The larger the sample, the more likely that you will approach a
     normal distribution, larger samples are more robust with
     respect to assumptions
   – Different tests of normality work best on different samples
     sizes (Shapiro-Wilk for smaller samples, Kolmogorov-
     Smirnov for larger samples)
   – Non-parametric tests tend to require large sample
     approximations for large samples (the tables for large samples
     aren’t published)
         Two Samples - Continued
• Is each sample normal in its distribution?
   – If one of your samples fails the test for normality,
     then it is almost always better to use a non-parametric
     test
• If your samples are normal, then you will use a t-
  test, but the standard t-test pools the variance
  from each sample
• Are your variances are roughly equal, if yes, then
  that is the correct statistic, but if they aren’t, then
  you will want to use a non-pooled variance T-test
  to compare the means of your samples
                    Three or More Samples
                             Three or more Samples
                                                                                      Our course only
                                                                                      covered 2 options
                                                                                      for three or more
                                                                                      samples
Question: Sample Size
large (>30) samples                        small samples
                                                                                      You should note
Question: Is the sample normally           Question: Is the sample normally
distributed                                distributed
                                                                                      that I left out the
Check with K-S Test                        Check with S-W Test                        ANOVA pretest
    normal               not normal            normal               not normal        for equality of
                                                                                      variances
                                                                                      (Levene’s Test)


Analysis of                                Analysis of
Variance, then T-   Kruskal-Wallis, then   Variance, then T-   Kruskal-Wallis, then
tests               Wilcoxon Rank Sum      tests               Wilcoxon Rank Sum
           Three or More Samples
• We only need to ask two questions:
   – What is our sample size?
   – Are all our samples normally distributed?
• Once we determine the sample size and run the correct
  test for normality, we can select the appropriate test to
  compare samples.
• If even one sample is not normal, then we should use
  the Kruskal Wallis in lieu of the ANOVA
• If all samples are normal, then you have to run the
  Levene’s Test for equality of variance before the data
  can meet the assumptions for an ANOVA
Three or More Samples - Continued
• Remember that when you have completed your
  comparison of samples, that a rejection of the
  null hypothesis (that they are all the same) is
  only the first step
• When you determine that there is a difference,
  you then have to find which samples differ via a
  series of T-tests (if normal) or Wilcoxon Rank
  Sums (if not normal)
• Your work isn’t done until you have determined
  which samples differ significantly
       Two Variable Associations
• We started looking at association with simple
  tests for independence.
• Given two variables, we used a Chi-Square
  Goodness of Fit comparison of the observed data
  vs an expected distribution where the variables
  were completely independent.
• From there we moved into measures of
  association or correlation to assess the strength
  and potentially the direction of the association
           Key Questions for any
           Association Problem
• First: What is the level of measurement for your
  data?
• The following question depends on your first
  answer
  – If nominal, then what is the size of your table
  – If ordinal and in categories, then what is the geometry
    of your table (square or rectangular)
  – If ordinal and in ranks, then no further questions
  – If interval ratio data, then is it normal
          Nominal Associations
• If you have nominal data, then your best recourse
  is to test for independence between the nominal
  variables using a Chi-square goodness of fit test
• Once you have determined if there is an
  association, you should use Phi to assess its
  strength if you have a 2x2 table and Cramer’s V
  if you have a larger than 2x2 table
     Ordinal Category Associations
• If your data is in Ordinal Categories (with a clear
  hierarchy), then your biggest question is whether
  or not the table is symmetrical (2x2, 3x3, etc.) or
  assymmetrical (2x3, 3x4, etc.)
  – If it is symmetrical, then you use Kendall’s Tau-b, so
    you can include ties into your analysis
  – If it is assymetrical, then you use the less sensitive
    but more versatile Kendall’s Tau-c
    Ordinal Rank Associations
• This type of data is continuous and can
  therefore be treated much like interval/ratio
  data.
• The only difference is that instead of
  running your correlation on raw numbers,
  you run it on ranks via a Spearman’s Rank
  Correlation
     Interval Ratio Associations
• The definitive parametric correlation is the
  Pearson’s Product Moment Correlation
• However, this test requires both bivariate
  normality and a linear relationship so it if
  fails a test for normality or the scatter plot is
  clearly non-linear, then you should rank
  your data and use the non-parametric
  Spearman’s Rank Correlation
        Summary Table for Associations
                           Tests of Independence   Measures of Association
Level of Measurement                               Strength       Strength and Direction
Nominal Category Data
2x2 Tables                                         Phi
2x3 Tables or Larger       Chi-Square              Cramer's V
Ordinal Category Data
Symmetric Tables           Kendall's Tau-b                         Kendall's Tau-b
Assymetric Tables          Kendall's Tau-c                         Kendall's Tau-c
Ordinal Rank Data          Spearman's Rho                          Spearman's Rho
Interval Ratio Data
Normally Distributed       Pearson's r                             Pearson's r
Not Normally Distributed   Spearman's Rho                          Spearman's Rho
 Note that there is no measure that will determine the direction of the
 association in purely nominal data. But if your data is pseudo-nominal
 (ordinal) then you can make the determination by looking at the major
 diagonal and off diagonal of the table.
 If your data is potentially Ordinal, then you should consider a
 Kendall’s test in lieu of the Chi-square
    I promised on the first day that we would
                cover all of this:
BeginData    Describe      No      Test        Yes      One        No      Two
Analysis    Variables?          Hypothesis?           Variable?          Variables?


                   Yes                        No
                                                                        Organized in
                                                     One Sample
                                                                           Tables
             Describe
            Distribution

                                                        Two             Measures of
                                                      Samples           Association
            Measures of
             Centrality

                                                                         Three or
                                                     Analysis of
                                                                          More
                                                      Variance
                                                                         Variables
            Measures of
            Dispersion



             Estimate
            Population
                                                                          End Data
              Values                                                      Analysis
  Association Between Three or
        More Variables
• Given the tools that you now have, dealing with
  multiple dependent variables is only an extension
  of the more simple two variable analysis
• Typically what we do is create a matrix of
  correlations between each of the variables and
  then observe their relationships to one another
• The statistics are exactly the same, but we run
  them multiple times (once for each pair of
  variables)
Example Output from
      SPSS                                                                                            Descriptiv e Statistics

                               Correlations                                                            Mean     Std. Deviation   N
                                                                                      VAR00001         1.5350         1.19661        20
                                      VAR00001      VAR00002 VAR00003
 VAR00001    Pearson Correlation
                                                                                      VAR00002        54.6500       25.39224         20
                                             1          -.567**   .263
             Sig. (2-tailed)                             .009     .263                VAR00003        13.0000         4.80132        20
             N                                 20          20       20
 VAR00002    Pearson Correlation            -.567**         1    -.526*
             Sig. (2-tailed)                 .009                 .017

 VAR00003
             N
             Pearson Correlation
                                               20
                                             .263
                                                           20
                                                        -.526*
                                                                    20
                                                                     1
                                                                                      Pearson’s r
             Sig. (2-tailed)                 .263        .017
             N                                 20          20       20
  **. Correlation is significant at the 0.01 level (2-tailed).
  *. Correlation is significant at the 0.05 level (2-tailed).


                                                                                     Correlations

                                                                                                      VAR00001 VAR00002 VAR00003
                                         Spearman's rho          VAR00001   Correlation Coefficient      1.000     -.533*    .162
                                                                            Sig. (2-tailed)                    .    .015     .495
                                                                            N                                20       20       20

Spearman’s Rho                                                   VAR00002   Correlation Coefficient
                                                                            Sig. (2-tailed)
                                                                                                          -.533*
                                                                                                           .015
                                                                                                                  1.000
                                                                                                                        .
                                                                                                                            -.419
                                                                                                                             .066
                                                                            N                                20       20       20
                                                                 VAR00003   Correlation Coefficient        .162    -.419   1.000
                                                                            Sig. (2-tailed)                .495     .066         .
                                                                            N                                20       20       20
                                            *. Correlation is significant at the 0.05 level (2-tailed).
  The End

-for now at least

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:10
posted:8/27/2012
language:English
pages:25