# Lecture_22

Document Sample

```					            Today’s Lecture
• Interpreting data and problems to help
select the correct statistical test
• Introduction to the analysis of 3 or more
variables
The First Things that You Should
Do When Given Data and a Problem
• First Question:
– What type of data do I have?
– What is the level of measure?
– How many sets of data are there?
• Second Question:
– What is being asked of me in the question?
– Does the question mention any key words like
compare or associate?
Narrowing the Range of
Possibilities
• In our class, I have taught you what I viewed to be the most
applicable tests to the various types of data that you will
encounter.
• There are entire groups of methods that deal with data forms that
we didn't cover.
• What this means is that your options on the exam (although
seemingly large) are actually quite limited.
• We spent the bulk of our time in hypothesis testing working on
two types of statistical cases.
• The first was comparisons of samples via their means, medians,
distributions, variances, etc.
• The second was the association of two variables at different
levels of measurement.
Samples and Variables
• Samples are the portion of a population that is observed.
• At their simplest, they are a representation of a larger
group.
• Variables are measurable phenomenon whose values
change from observation to observation.
• In statistics, samples of variables can exist for data at any
level of measurement.
• Variables are often associated with one another, such
associations can be spurious or a potential source of
causality.
• If we were interested in comparing the AFC to
the NFC, what would be the correct method.
• If we look at the data (point differential for each    Example
team in the AFC and NFC) we can see that it is
definitely a variable.                                AFC (PF-PA) NFC (PF-PA)
• But when we look at the column head, it would                   208          90
be easy to consider AFC and NFC to be a                         -26          41
categorical variable as well.                                   -69           -8
• But is this a two variable case? No, it isn't really.           -78            1
• There is only one variable here (point                           93          52
differential). The categories are immaterial                     -6          15
because our comparison is between the NFC and                     5          12
AFC.                                                            -46         -37
• The nature of the test that we would use assumes                118          20
in its null hypothesis that there is no difference               41         -24
between the AFC and NFC.                                          4         -13
• It assumes that they are two samples from the                   -23         -54
same population (the NFL).                                      -22          26
• So AFC and NFC are not a nominal variable in                     25         -17
the statistical sense.                                         -101         -82
-25        -120
• Here we have Point Differential
Another     plotted vs Number of Wins
Example   • There are clearly at least two
variables here
• Any question or hypothesis
would deal the association
between two variables
10
9
8
7
6
Wins

AFC
5
NFC
4
3
2
1
0
-200   -100     0          100     200   300
Point Differential
Decisions for One Variable
If we have data with one continuous variable, then we have a
number of options in terms of analysis (all of which are
essentially comparisons of samples to samples or samples to
populations)
First Question: How many Samples?

One Sample       Two Samples                Three or More Samples
One Sample

Estimation of Parameters or Test of Distribution?

Estimation              Distribution
One Sample
Population Parameters Known?        Goodness of Fit
Yes              No
With one sample, the only
options are estimation of
population parameters (like
the mean, variance or
proportion), or
comparisons of the sample
distribution to a
hypothesized theoretical
distribution via a goodness
of fit (most commonly
done via a Chi-square Test)
Normal-
Distribution       T-Distribution   Chi-Square/K-S
Two Samples
Two Samples

Question: Are samples dependent or independent of one another?

Two Samples - Dependent or Paired                                            Two Samples - Independent

Question: Sample Size                                 Question: Sample Size
large (>30) samples            small samples                 large (>30) samples                                    small samples

Question: Is the sample    Question: Is the sample
normally distributed       normally distributed       Question: Is the sample normally distributed     Question: Is the sample normally distributed
Check with K-S Test        Check with S-W Test        Check with K-S Test                              Check with S-W Test
normal       not normal    normal       not normal                normal                 not normal                normal                not normal
Questions: Are the variances
Question: Are the variances equal                equal

Check with Ratio of Variances                    Check with Ratio of Variances
Yes                No                            Yes                No

T-Test (non-                                     T-Test (non-
Paired T- Wilcoxon Sign-   Paired T- Wilcoxon Sign-   T-Test (pooled   pooled             Wilcoxon- T-Test (pooled      pooled            Wilcoxon-
Test      Rank Test        Test      Rank Test        variance)        variance)          Rank Sum variance)            variance)         Rank Sum

With Two Samples, we have to ask a minimum of three questions to
Two Samples - Continued
• Are the samples independent of one another (remember
that paired cases require a slightly different approach)
• How large are our samples
– The larger the sample, the more likely that you will approach a
normal distribution, larger samples are more robust with
respect to assumptions
– Different tests of normality work best on different samples
sizes (Shapiro-Wilk for smaller samples, Kolmogorov-
Smirnov for larger samples)
– Non-parametric tests tend to require large sample
approximations for large samples (the tables for large samples
aren’t published)
Two Samples - Continued
• Is each sample normal in its distribution?
– If one of your samples fails the test for normality,
then it is almost always better to use a non-parametric
test
• If your samples are normal, then you will use a t-
test, but the standard t-test pools the variance
from each sample
• Are your variances are roughly equal, if yes, then
that is the correct statistic, but if they aren’t, then
you will want to use a non-pooled variance T-test
to compare the means of your samples
Three or More Samples
Three or more Samples
Our course only
covered 2 options
for three or more
samples
Question: Sample Size
large (>30) samples                        small samples
You should note
Question: Is the sample normally           Question: Is the sample normally
distributed                                distributed
that I left out the
Check with K-S Test                        Check with S-W Test                        ANOVA pretest
normal               not normal            normal               not normal        for equality of
variances
(Levene’s Test)

Analysis of                                Analysis of
Variance, then T-   Kruskal-Wallis, then   Variance, then T-   Kruskal-Wallis, then
tests               Wilcoxon Rank Sum      tests               Wilcoxon Rank Sum
Three or More Samples
• We only need to ask two questions:
– What is our sample size?
– Are all our samples normally distributed?
• Once we determine the sample size and run the correct
test for normality, we can select the appropriate test to
compare samples.
• If even one sample is not normal, then we should use
the Kruskal Wallis in lieu of the ANOVA
• If all samples are normal, then you have to run the
Levene’s Test for equality of variance before the data
can meet the assumptions for an ANOVA
Three or More Samples - Continued
• Remember that when you have completed your
comparison of samples, that a rejection of the
null hypothesis (that they are all the same) is
only the first step
• When you determine that there is a difference,
you then have to find which samples differ via a
series of T-tests (if normal) or Wilcoxon Rank
Sums (if not normal)
• Your work isn’t done until you have determined
which samples differ significantly
Two Variable Associations
• We started looking at association with simple
tests for independence.
• Given two variables, we used a Chi-Square
Goodness of Fit comparison of the observed data
vs an expected distribution where the variables
were completely independent.
• From there we moved into measures of
association or correlation to assess the strength
and potentially the direction of the association
Key Questions for any
Association Problem
• First: What is the level of measurement for your
data?
• The following question depends on your first
– If nominal, then what is the size of your table
– If ordinal and in categories, then what is the geometry
of your table (square or rectangular)
– If ordinal and in ranks, then no further questions
– If interval ratio data, then is it normal
Nominal Associations
• If you have nominal data, then your best recourse
is to test for independence between the nominal
variables using a Chi-square goodness of fit test
• Once you have determined if there is an
association, you should use Phi to assess its
strength if you have a 2x2 table and Cramer’s V
if you have a larger than 2x2 table
Ordinal Category Associations
• If your data is in Ordinal Categories (with a clear
hierarchy), then your biggest question is whether
or not the table is symmetrical (2x2, 3x3, etc.) or
assymmetrical (2x3, 3x4, etc.)
– If it is symmetrical, then you use Kendall’s Tau-b, so
you can include ties into your analysis
– If it is assymetrical, then you use the less sensitive
but more versatile Kendall’s Tau-c
Ordinal Rank Associations
• This type of data is continuous and can
therefore be treated much like interval/ratio
data.
• The only difference is that instead of
running your correlation on raw numbers,
you run it on ranks via a Spearman’s Rank
Correlation
Interval Ratio Associations
• The definitive parametric correlation is the
Pearson’s Product Moment Correlation
• However, this test requires both bivariate
normality and a linear relationship so it if
fails a test for normality or the scatter plot is
clearly non-linear, then you should rank
your data and use the non-parametric
Spearman’s Rank Correlation
Summary Table for Associations
Tests of Independence   Measures of Association
Level of Measurement                               Strength       Strength and Direction
Nominal Category Data
2x2 Tables                                         Phi
2x3 Tables or Larger       Chi-Square              Cramer's V
Ordinal Category Data
Symmetric Tables           Kendall's Tau-b                         Kendall's Tau-b
Assymetric Tables          Kendall's Tau-c                         Kendall's Tau-c
Ordinal Rank Data          Spearman's Rho                          Spearman's Rho
Interval Ratio Data
Normally Distributed       Pearson's r                             Pearson's r
Not Normally Distributed   Spearman's Rho                          Spearman's Rho
Note that there is no measure that will determine the direction of the
association in purely nominal data. But if your data is pseudo-nominal
(ordinal) then you can make the determination by looking at the major
diagonal and off diagonal of the table.
If your data is potentially Ordinal, then you should consider a
Kendall’s test in lieu of the Chi-square
I promised on the first day that we would
cover all of this:
BeginData    Describe      No      Test        Yes      One        No      Two
Analysis    Variables?          Hypothesis?           Variable?          Variables?

Yes                        No
Organized in
One Sample
Tables
Describe
Distribution

Two             Measures of
Samples           Association
Measures of
Centrality

Three or
Analysis of
More
Variance
Variables
Measures of
Dispersion

Estimate
Population
End Data
Values                                                      Analysis
Association Between Three or
More Variables
• Given the tools that you now have, dealing with
multiple dependent variables is only an extension
of the more simple two variable analysis
• Typically what we do is create a matrix of
correlations between each of the variables and
then observe their relationships to one another
• The statistics are exactly the same, but we run
them multiple times (once for each pair of
variables)
Example Output from
SPSS                                                                                            Descriptiv e Statistics

Correlations                                                            Mean     Std. Deviation   N
VAR00001         1.5350         1.19661        20
VAR00001      VAR00002 VAR00003
VAR00001    Pearson Correlation
VAR00002        54.6500       25.39224         20
1          -.567**   .263
Sig. (2-tailed)                             .009     .263                VAR00003        13.0000         4.80132        20
N                                 20          20       20
VAR00002    Pearson Correlation            -.567**         1    -.526*
Sig. (2-tailed)                 .009                 .017

VAR00003
N
Pearson Correlation
20
.263
20
-.526*
20
1
Pearson’s r
Sig. (2-tailed)                 .263        .017
N                                 20          20       20
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).

Correlations

VAR00001 VAR00002 VAR00003
Spearman's rho          VAR00001   Correlation Coefficient      1.000     -.533*    .162
Sig. (2-tailed)                    .    .015     .495
N                                20       20       20

Spearman’s Rho                                                   VAR00002   Correlation Coefficient
Sig. (2-tailed)
-.533*
.015
1.000
.
-.419
.066
N                                20       20       20
VAR00003   Correlation Coefficient        .162    -.419   1.000
Sig. (2-tailed)                .495     .066         .
N                                20       20       20
*. Correlation is significant at the 0.05 level (2-tailed).
The End

-for now at least

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 10 posted: 8/27/2012 language: English pages: 25