# Today�s Lecture Topics

Document Sample

```					       Today’s Lecture Topic
• Tests of Distributions
– Why do it?
– What are our options?
• Goodness of Fit                  k     ( f obs  f exp ) 2
2  
– Chi-Square                    i 1          f exp
– Kolmogorov-Smirnovn
D  max S ( x)  F ( x)
Reference Material
• Burt and Barber, pages 353-364
Tests of Distributions
• Many of our hypothesis tests thus far have
distribution of the characteristic of interest
• Often we assume normality in a random variable
because of the central limit theorem
• Often we assume that the variances of the
distributions of two samples are the same
• Thus far we have been testing all of our
hypotheses first with a parametric test and then
again with a non-parametric test
What Can These Tests Do?
• If we wanted to test a hypothesis without risking a
violation of assumptions there are many “pre-tests”
that are available to:
–   Test for normality
–   Compare variances
–   Compare entire distributions
–   Test for independence between variables
• Today we will focus on comparing distributions in
the general sense and Thursday we will directly
test for normality and equality of variance
Why Does This Matter?
• 1st off, it provides a means of comparing
two variables and their distributions
• 2nd, it reinforces your results by proving that
your model of hypothesis testing is not
violating key assumptions
• 3rd, it tips you off when a variable requires a
non-parametric treatment
removing the need to perform multiple
tests to verify the result of a hypothesis
test
Recall
• The reason we assume normality in so many situations is
because of Central Limit Theorem
• We are actually assuming that a hypothetical distribution
of sample means will approximate a normal distribution
as the number of samples grows very large
• Even if the underlying distribution that the samples are
drawn from is not normal, the sampling distribution will
approach normality if the distribution is reasonably
symmetric
• Under Central Limit Theorem, anything can be assumed
to be normal given a sufficient sample size and
“reasonably well behaved” population distribution
Theory is Nice, but Reality is
Often Cruel
• Despite the Theoretical Underpinnings of
Parametric Statistical Analysis, if you
assume normality and then perform a
parametric test on data that is far from
normal, you run a serious risk of a Type I or
Alpha Error
• That said, you can’t run non-parametric
tests for every situation because they are at
best 95% as efficient as their parametric
equivalent
An Example of Pitfalls

Not “reasonably
well behaved”

Source- http://www.indiana.edu/~jkkteach/ExcelSampler/
What to Do?
• If we want to be more certain about our sample
and its relationship to other samples or a
specific probability distribution, we can run
tests to see how well it fits
• A Goodness of Fit test is a non-parametric
comparison that can be used to assess
normality and a host of other distribution
characteristics
So How Does It Work?
• If we assumed that a population had a specific theoretical
distribution (read normal here) and knew the parameters
of that distribution we could define its exact shape
• Even without its parameters, if we had a sample that was
supposed to be from that population we could use the
sample parameters to estimate its shape
• Given that shape, we would expect that if the sample
actually came from the assumed population that the
distribution of the sample would be close to the estimated
distribution of the population
• If they were not similar, then it is likely that the sample is
not from the hypothetical distribution and our assumption
Setting Up a Chi-Square Test
• If we are interested in whether or not a sample is from a
distribution that is normal we would set up a Chi-Square
Goodness of Fit with the following hypothesis:
• H0: The Distribution is Normal
• HA: The Distribution is not Normal
• The Chi-Square requires degrees of freedom and in a
goodness of fit test, our degrees of freedom are
determined by how much we know about the population
– We start with the number of classes that we will use in our fit,
the more classes the more tight the fit, naturally classes are
denoted by the letter k
– If we know both mu and sigma^2, then our df is k-1
– If we are going to estimate both from the sample, then our df is
k-1(-2 more) for k-3
Considerations
• The degrees of freedom say a great deal
about the specificity of the test
• If we are only interested in normality, then
we use k-3 and run what is essentially a
comparison of shape (via the k classes)
• But if we are using specific population
parameters, then we are trying to show that
the sample is not only normal but also
assumes the specified parameters of the
population
What Do We Actually Compare?
• The way that this test works is that it compares
the observed number of events in a class to the
expected number of events in a class
• This can be done with virtually any distribution
and even works well for comparing two
variables
• In this weeks efforts, we are going to limit
ourselves to comparing a sample to the normal
distribution
Choosing the Number of Classes
• Obviously since k determines the number of degrees
of freedom, it behooves us to maximize the number
of classes
• That said, there is a rule of thumb that states that no
class should have fewer than five observations
• If more than 20% (one fifth) of the classes have
expected frequencies of less than five, then the test is
invalid
• It is ok to pool adjacent classes at the tails so that
they have a higher expected frequency, but when you
pool, you lose degrees of freedom
The Statistic
• There are only three             k     ( f obs  f exp )   2
elements to this          
2
statistic                                     f exp
i 1
• k=the number of
classes
• f(obs)=the number of
observations in each
class                    Despite the fact that this is a
• f(exp)=n*(1/k) for a     non-parametric test, there are
two assumptions:
sample to distribution
comparison               1-the sample is random
2-the sample is reasonably
large (30 would give 3 df)
An Example Problem
• Recall the Exam Scores from Homework 2
• It was a single sample of exam scores with a mean of
76.125 and a standard deviation of 12.225
• What we were interested in comparing these scores to
another sample of exam scores via a T-test?
• What if this sample is from a distribution that isn’t
“reasonably well behaved”?
• If we could show that it was normal, it would make our
analysis considerably less stressful with respect to alpha
errors
Histogram
k (f      f exp ) 2
2  
8                                             150%
obs
Frequency

6                                             100%
4
50%

i 1       f exp              2
0                                             0%
30

40

50

60

70

80

90

100
Off To Excel
Results
• Our Goodness of Fit Test with k=8 classes
had a Chi-Square Sum of 1.2 with k-3
degrees of freedom, this provides a p-value
of 0.94
• By the same token, if we look up a 0.05
alpha on the Chi^2 table in your book (page
620) we find that a critical value of 11.07
would be required to reject
• This distribution may not be normal, but
statistically it is close enough
Another Goodness of Fit Test
• The Chi-Square is a well known test that
has been around for nearly a century (it
dates back to Pearson and was made
commonplace by the 1920’s)
• Since that time, there have been numerous
improvements in the methods to determine
goodness of fit or distribution normality
• One such improvement is the Kolmogorov-
Smirnov Test
Kolmogorov-Smirnov
• Dr. Fik always referred to this test as the Vodka
Test, but its origins have nothing to do with
Smirnov Vodka
• Kolmogorov was a very talented Soviet
mathematician that created this test in the late 30’s
• It is a very popular test because it is sensitive to
both differences in location (central tendency) and
shape (dispersion)
• Today there are a couple of more powerful tests
that are strictly for normality, but the K-S test is
still one of the most versatile
D-Statistic
• Two assumptions:
– The cumulative distribution function S(x) is
generated from a random sample of n
observations
– This is compared to the cumulative probability
distribution function of some hypothesized
distribution F(x)
• Hypothesis:
– H0:S(x) = F(x), HA:S(x) ≠ F(x)
• Test Statistic:
D  max S ( x)  F ( x)
Same Example
• Once again, a test for normality
• Your book gives the critical value in table
A-9 of the appendix (Page 621)
• Note that the test is expressly designed for
smaller samples
• n=40, alpha=0.05
• D(critical)=0.210
Back to Excel
Results
• The results of our test yield only a slight
difference – D(max)=0.065 so we definitely
fail to reject, the Kolmogorov-Smirnov Test
suggests that this data is normal
• D(critical)=0.210>D(obs)=0.065 so fail to
reject H0
Homework 15
• Given a subset of the Dissolved Oxygen
data from your text book (n=36 samples)
use a Chi-Square test with k=6 classes to
test the sample for normality
• Given the same subset, use a Kolmogorov-
Smirnov to test for normality
• Make all tests at alpha=0.10

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 2 posted: 8/9/2012 language: pages: 25
How are you planning on using Docstoc?