; paper
Learning Center
Plans & pricing Sign in
Sign Out



  • pg 1
									                                                                                           Cornwell 1

Paul Cornwell
March 25, 2011
MAT 5900- Monte Carlo Simulations
Professors Frey and Volpert

                Necessary Sample Size for Good Central Limit Theorem Approximation

           In conjunction with the Law of Large Numbers, the Central Limit Theorem is the

foundation for the majority of statistical practices today. The term Central Limit Theorem (CLT)

actually refers to a series of theorems, but in practice it is usually condensed as follows: averages

of independent, identically distributed variables (with positive variance σ) tend towards the

normal (Gaussian) distribution as sample size n gets large. Specifically, the normal distribution

to which the sampling distribution tends has mean            and variance           . The utility of

this theory is grounded in the idea of averages. Even if the underlying distribution of some

variable is finicky, it can still be analyzed through the use of averages. For this reason, the CLT

is invoked in many situations involving predictions or probabilities. Because of the “limit” part

of the theorem, however, the normal distribution will only be an approximation for the sampling

distribution if n is finite. Exactly how large n has to be for the CLT to provide a good

approximation is an elusive question, and it turns out that Monte Carlo simulations are as good a

tool as any for making such determinations.

           The generalization of the CLT commonly taught in statistics classes today is a theorem of

Laplace from 1810. In discussing this result, Hans Fischer says the normal approximation is

appropriate “under conditions that, in practice, were always fulfilled.”1 As Fischer indicates,

there is not a lot of concern surrounding the conditions (i.e. the necessary sample size) needed to

appeal to the CLT. However, it is still important to know at what point it becomes acceptable to

do so. As sample size increases, the normal approximation becomes better and better, so at the

    A History of the Central Limit Theorem 353
                                                                                          Cornwell 2

very least having a target n in mind would give an indication of the quality of an approximation,

even if it is impossible to alter the design of an experiment or study by increasing the sample

size. One problem with this (which is probably responsible for a lack of guidelines on the topic)

is the tremendous dependence of the sampling distribution on the underlying distribution. Thus,

instead of getting too specific, the word “large” is employed to describe the requisite sample

size. My goal is to come up with a more specific procedure for determining how large of a

sample size is needed to get a good approximation from the Central Limit Theorem.

       The problem of diverse underlying distributions is not the only obstacle in this endeavor.

Prior to giving any kind of numerical answer(s) to the question at hand, it is necessary to define

exactly what it means to be a “good” approximation. Practically speaking, the sampling

distribution of a random variable is never normal. For this reason, using conventional “tests for

normality” is a task doomed for failure. Given enough data, they will always be able to reject the

hypothesis that the sampling distribution is not normal—because it’s not. However, there are a

few techniques to be salvaged from this field that will be of use. Instead of doing conventional

tests, the best way to approach this problem is to consider the qualities of the Gaussian

distribution. This way, there is at least some basis of comparison for the sampling distribution to

the normal. Once these criteria are in place, the next task is to determine how closely a

distribution must resemble the normal for each, which will then determine whether or not the

sample size is large enough for a good approximation. Finally, I will have to devise a way of

communicating these results. It is not necessarily practical to have a separate number for every

possible distribution for two reasons: first, there are far too many for this information to be

helpful; second, in practice it is common not to know the underlying distribution for a random
                                                                                            Cornwell 3

variable, only certain features. Instead, it would be helpful if I could identify some common

features of these distributions that affect the speed of their convergence.

          The normal distribution can be observed to have the following properties: it is unimodal;

it is bell-shaped; it is symmetric; and it is continuous. Thus, in order for a sampling distribution

to be approximated by the normal, it should certainly exhibit these traits. Each of these can be

measured qualitatively using different metrics. Skewness, for example, is defined as the third

standardized central moment. For any continuous or discrete probability density function f (x),

this is given by the equation                    . As the name suggests, this moment measures the

skewness of a distribution, or more specifically its asymmetry. While skewness of zero does not

necessarily imply perfect symmetry, perfect symmetry implies skewness equal to zero. Thus, the

normal distribution has skewness equal to zero. The “peakedness” of a distribution is measured

by the fourth centralized moment:                      .2 This statistic is called kurtosis, and it

measures, depending on who you ask, how peaked or flat a distribution is, or how heavy the tails

are. The kurtosis of the normal distribution is equal to three, so it is common practice to subtract

three from the calculated fourth moment to quantify peakedness vis-à-vis the normal. This

modified statistic is known as “excess kurtosis.”3 This is the statistic that I observe in my

simulations. In theory, there is nothing special about the first four moments (these two, along

with μ1 = mean and μ2 = variance). It turns out that these numbers are insufficient for describing

a distribution in full. However, they have the advantage of being fairly easy to compute, and

    Testing for Normality 41
    Wolfram Math World “Skewness”
                                                                                               Cornwell 4

behaving well for averages.4 Thus, if used in combination with other measures, they could be

useful tools in determining the quality of an approximation.

           The various moments of a distribution can all be calculated analytically, without the aid

of simulation. Other measures, however, necessitate some kind of simulation in order to work. R

is very useful in this case, because it can easily generate variables from a large number of built-in

distributions or homemade PDFs. There are two “size” parameters at play in my code. The first

(N) is the number of observations in the sampling distribution. This number should be large so

that one has a thorough idea of what the sampling distribution looks like. Unless otherwise

noted, I generated 500,000 sample means in every simulation. The second (n) is the number of

observations that are averaged to make each of those sample means. It is the latter of these that

should affect the shape of the sampling distribution and is therefore the subject of this

investigation. A sampling distribution for any distribution can be generated fairly easily and

quickly with a “for” loop, and that is all that is needed to begin a comparison to the normal. One

of the features of a sampling distribution that cannot be measured purely analytically is the

proportion of observations in the tails. We know that 95% of the data falls within approximately

1.96 standard deviations from the mean for any normal distribution. Thus, if you standardize the

sample based on its own mean and standard deviation, you can compare the size of the tails by

calculating the proportion of numbers whose absolute value is greater than 1.96. This calculation

takes practically no time, even for a large number of values N. This formula can be tweaked to

give the percentage of the data that falls within any number of standard deviations of the mean.

The so-called empirical rule states that, for the normal distribution, 68.22% of data is within one

standard deviation of the mean, 95.44% within two standard deviation and 99.73% of

    (from Dr. Frey) Skewness decreases by a factor of   and excess kurtosis by a factor of n
                                                                                          Cornwell 5

observations within three standard deviations.5 These proportions can be calculated through

simulation just like the tail probabilities, and they can be easily compared to the normal.

           Another valuable tool for comparison in R is the empirical cumulative distribution

function (EDF). A cumulative distribution (CDF) is a function plotting the probability that a

random variable will be less than x. Thus, the limit of the CDF at negative infinity is zero, and at

positive infinity is one. Given any PDF (even a sample), one can make a CDF empirically by

simply plotting the sample observations as the independent variable versus the proportion of the

sample elements less than or equal to each observation as the dependent variable. This can be

done easily in R by simply sorting the vector containing the sample and plotting it against the

cumulative probabilities. For an easy contrast with the normal, one can standardize the sampling

distribution based on its mean and standard deviation and compare it to the standard normal

CDF. The better the approximation by the normal, the closer these two graphs should be. In

practice, it turns out that the most deviation between two CDFs (usually) occurs in the middle,

which makes sense because the extreme values (limits) must be the same for all CDFs. Although

even an “eye test” is valuable with these graphs, there is a way to quantify closeness between

CDFs. Kolmogorov-Smirnov distance is defined as the maximum distance between two CDFs.

Although a test for normality based on this statistic is unpopular based on its poor power and

inapplicability for discrete distributions, it is still a good measure for quality of approximation

that should be relatively consistent from simulation to simulation.

           For the reasons discussed above, there are few guidelines for a minimum sample size

required to invoke the CLT in practice. The general rule of thumb that one sees in introductory

statistics courses calls for n ≥ 30—not a very helpful guide. In reality, depending on the

underlying distribution, this recommendation can either be overkill, or entirely inadequate. A
    A First Course in Statistical Methods 81
                                                                                            Cornwell 6

slight improvement on this is with the binomial problem, where the recommendation is a

function of the parameter p, which gives the probability of a “success” for a binary variable.

Generally, one will see recommendations that n be greater than the minimum of 5/p and 5/q,

where q is equal to (1-p).6 The reasoning for this is that the skewness of the binomial distribution

increases as p moves away from .5. In general, the closer a distribution is to normal, the quicker

its sampling distribution will approach normal. Others suggest that the values np and nq be

greater than ten instead of five, but the idea is the same. Besides the binomial distribution,

however, there is not much guidance beyond n having to be “large.”

           To come up with a set of criteria for these statistics is seemingly arbitrary. However, it

must be done in order to have some kind of objective scale for identifying a good approximation

to the normal distribution. One way to make this process easier to have two different thresholds:

one for a superior approximation and the other for an adequate one. By looking at both

histograms of the samples I generated and comparisons of the empirical distribution functions to

the normal cumulative distribution function, I decided that the following standards are requisite

for an “adequate” approximation to normal: excess kurtosis should be less than .5 in magnitude;

skewness should be less than .25 in magnitude; the tail probabilities for nominal 5% should be

between .04 and .06; and Kolmogorov-Smirnov distance should be less than .05. For a superior

approximation, the following numbers are required: excess kurtosis less than .3 in magnitude;

skewness less than .15 in magnitude; tail probabilities for nominal 5% should be between .04 and

.06; and K-S distance should be less than .02. One could argue that these requirements are fairly

conservative, but sometimes there is very little differentiating two distributions, especially

considering the traits that all probability density functions must share by virtue of being

    A First Course in Statistical Methods 167
                                                                                            Cornwell 7

probability distributions. What follows is the application of these measures to various underlying


       It turns out that one of the fastest distributions to converge to normal in averages is the

continuous uniform. The only parameters of the uniform distribution are the endpoints of the

interval. The third and fourth moments are independent of the length of the interval, and

experiments with changing the length suggest that doing so does not affect the rate of

convergence. Thus, for all of my simulations, I used the interval [0,1]. A distribution of 500,000

samples of size three already seems to be approximately normal by looking at a histogram of the

data. Because of its symmetry, the uniform distribution has no skewness, so one would expect

samples to behave the same way. The excess kurtosis, however, is equal to -1.2 for the

continuous uniform distribution. Thus, for a sample of size five it is equal to -.24 (kurtosis

decreases by a factor of n). Generating uniform samples of size five gave a diminutive

Kolmogorov-Smirnov (KS) distance of .005, which suggests a very tight fit by the normal

distribution. The percentage of data in the tails (compared to the normal value of 5%) was .048.

The continuous uniform is a member of the beta family of distributions (α = β =1). The beta has

two parameters, α and β, and is defined on the interval [0,1]. If one lets α = β =1/3, then the

result is a bimodal, symmetric distribution, with a valley at its mean. It turns out that the

convergence of this distribution is fast as well. The exact sample sizes needed for each statistic

are summarized in the tables at the end. Other than the normal distribution itself, it is unlikely

that a distribution will converge this quickly, but it could still be helpful to have some of the

statistics of this sampling distribution in order to see what is required for a good fit.

       Returning to the binomial distribution, it seems as if the rule of thumb that the product of

n and p should be greater than 5 to invoke the Central Limit Theorem is insufficient. Even when
                                                                                          Cornwell 8

that product is ten, it turns out that there is still a large deviation between the two CDFs around

the middle of the distribution. The following graph shows how Kolmogorov-Smirnov distance

decreases as np increases.

                        K-S Distance for Binomial versus n*p
         K-S Distance
                        0.06                                                            p = .5
                        0.04                                                            p = .1
                               0   5     10      15        20    25      30      35

As expected, the quality of approximation (at least insofar as it is measured by K-S distance)

seems to vary with np, so it is appropriate to speak about that product instead of a particular

proportion p. It appears that the K-S level doesn’t begin to level off really until about np = 15.

And it is not until np = 30 that it finally gets down to the .05 level. This number indicates

problems in the middle of the distribution, but even the tails are not as close as is desired for np =

20. A sample of 20 sampling distributions when n = 40 and p = .5 yielded an upper bound of

.0414 for the proportion of data outside 1.96 standard deviations. This is at the very bottom of

the target area (.04, .06) compared to the normal distribution .05. Once np is greater than 30, the

tail percentage is a more desirable .0519. The skewness of the binomial is zero when p = .5, but

for p = .1 it requires a sample size of 317 to get the skewness below .15. Even this number is

small compared to the value of 1850 needed to get the K-S distance under .02. Part of the
                                                                                            Cornwell 9

problem is that the binomial distribution is discrete, which hurts the K-S statistic, but it still

seems that the requirement that np be bigger than 10 (n =100 for p = .1) is not adequate.

        Another common continuous distribution is the exponential. The exponential has one

parameter, λ, which is the inverse of its mean. It turns out that the rate of converge for this

distribution in averages is independent of that parameter; thus, I elected to use λ =1. When n is as

small as eight, the K-S distance drops below .05. However, at this point, the EDF still indicates a

difference in the tails of the data because the skewness and kurtosis are a little high (.75 for

kurtosis and 2-½ ≈ .707 for skewness). I think that when the sample size is ten the approximation

becomes better. The pictures below show a histogram of the data from the sample, and a graph of

the EDF versus the CDF for the normal distribution. When the sample size is 10, the excess

kurtosis and skewness values are .6 and .63 respectively, and the distribution appears to follow

the normal fairly closely.

The one problem—and this is evident in the histogram more than the EDF—is the positive skew

of this distribution. This distribution shows that it takes a long time to get rid of asymmetry from

the underlying distribution. In this case, it is not until n reaches 64 that the skewness becomes
                                                                                         Cornwell 10

low enough for an adequate approximation. The next highest n value needed to get in range for

the other statistics is 12 for the kurtosis. Likewise, for a superior approximation a sample size of

178 is needed to get a low enough skewness (n = 45 needed for the K-S distance is next highest).

At this point, the normal approximation for exponential samples is very good.

        With all of this discussion of the Central Limit Theorem, it is worth mentioning the times

that it is inapplicable. The theorem states that the random variables being averaged must have

positive variance, which means it must have at least two moments. Thus some distributions with

no moments (heavy tails) should not be approximated well by the normal. For example, the

Cauchy distribution has no mean, and therefore no variance or any other moments. The PDF of

the Cauchy distribution is                     , where b is the width of the distribution at half its

maximum and m is the median. Integrating this function confirms that it is a PDF regardless of b,

but one runs into problems trying to find the mean. Letting m = 0 (center the distribution at the y-

axis), the anti-derivative                                 represents the first moment of the

Cauchy distribution. Since the log function grows unbounded, this distribution has no mean for

any value of b. Thus, for distributions with no mean or variance, one would not expect any kind

of convergence to the normal. In fact, it turns out that the Cauchy distribution looks the same in

averages of any sample size (including n = 1). One can see here how the EDFs for 50,000

Cauchy variables with mean zero and location one and a sample of 50 Cauchy samples with the

same parameters look the same:
                                                                                       Cornwell 11

        Another interesting example of this phenomenon is the Student’s t distribution. This

distribution has heavier tails than the normal, and the number of moments it has depends on the

number of degrees of freedom. This distribution could be helpful in determining the minimum

sample size needed for a symmetric, unimodal distribution because its tails get increasing lighter

as sample size increases. Considering that the variance of the Student’s t only exists for degrees

of freedom (df = n - 1) greater than 2 and the kurtosis for df > 4, the convergence of the Student’s

t is remarkably fast. The underlying distribution itself—not even the sampling distribution—

becomes quite close to normal with as few as 5 degrees of freedom. In this case, the K-S distance

is already down to .03, and the CDF is very close to the normal. The value outside 1.96 standard

deviations is 5.3%, which is more than the normal (as expected), but still very close. One thing to

note here is the value of the kurtosis. It is   for the Student’s t, which shows that even

distributions with a high kurtosis (6 in this case) can be good approximations to the normal. I

looked at averages for two cases: first, with degrees of freedom equal to 2.5; and second, with

degrees of freedom equal to 4.1. To get the K-S distance sufficiently small for df = 2.5, it

required a sample size of 320. This shows how the sample size will blow up and that eventually
                                                                                         Cornwell 12

the Central Limit Theorem won’t apply as degrees of freedom get arbitrarily close to 2. For

degrees of freedom = 4.1, the K-S distance gets very small with samples as small as 5. The only

statistic that is difficult to lose is the kurtosis (n = 200 for superior approximation). The example

of the Student’s t casts some doubt on the importance of the fourth moment as a metric for

comparing distributions.

       Below are two tables containing the results of my investigations. The tables are organized

so that it reports the minimum sample size needed to satisfy each criteria. The minimum sample

size needed for an adequate or superior approximation would therefore be the maximum in each

row, which is also reported.

Adequate Approximation

                                                              Tail prob.      K-S Distance
  Distribution       ∣Kurtosis∣<.5     ∣Skewness∣<.25       .04 < x < .06         <.05

Uniform                     3                  1                  2                2                    3
Beta (α=β=1/3)              4                  1                  3                 3                   4
Exponential                12                  64                 5                 8                   64
Binomial (p =.1)           11                 114                 14               332              332
Binomial (p =.5)            4                  1                  12               68                   68
Student’s t                N/A                N/A                 13               20                   20
2.5 df
Student’s t                120                 1                  1                 2               120
4.1 df
                                                                                          Cornwell 13

Superior Approximation

                                                               Tail prob.     K-S Distance
  Distribution       ∣Kurtosis∣<.3      ∣Skewness∣<.15       .04 < x < .06        <.02

Uniform                     4                   1                  2               2                    4
Beta (α=β=1/3)              6                   1                  3                 4                  6
Exponential                20                 178                  5                45                 178
Binomial (p =.1)           18                 317                 14               1850                1850
Binomial (p =.5)            7                   1                 12               390                 390
Student’s t               N/A                 N/A                 13               320                 320
2.5 df
Student’s t                200                  1                  1                 5                 200
4.1 df

       The first thing I notice in looking at these tables is the magnitude of the numbers. With

the exception of the beta family of distributions, none of these fairly common distributions have

sampling distributions well approximated by the normal for n around 30. Because of the

widespread use of and reliance on hypothesis tests, it is interesting to note that it often takes a

very large sample size potentially to make such tests reliable. Also, it is clear that skewness in

the underlying distribution is the hardest thing to correct for in the sampling distribution. In the

distributions where kurtosis was high, the alignment of the EDF and the normal CDF was much

faster than those with a strong skew. Another alarming result is the high sample size

requirements for the binomial distribution. Part of the problem in minimizing the K-S distance

here was the fact that the distribution only takes on discrete values. If one were to employ

continuity correction methods to the binomial, I suspect that it would be well approximated by

the normal much faster than in my study. Finally, these studies have made me very suspicious of

kurtosis as a valuable statistic for measuring normality. Even when it is not defined, such as with

the Student’s t with four or fewer degrees of freedom, the other statistics suggest that the
                                                                                           Cornwell 14

convergence is quite fast. A graph of the PDF for the Student’s t with only five degrees of

freedom shows that it is already close to the normal. Thus, taking averages of those variables will

make the approximation even better with only a small value of n.

       So, what are the implications of all this? First, it reinforces the fact that statisticians

should strive to collect as much data as possible when doing inferential tests. Second, it indicates

the importance of methods for doing inference when traditional assumptions of normality fail. In

practice, it seems that the conditions for invoking the Central Limit Theorem are not so trivial

after all. That being said, its pervasiveness in the social and natural sciences makes it one of the

most important theorems in mathematics.

To top