Estimate Sample Standard Deviation of Data Set - Excel by cic99420

VIEWS: 0 PAGES: 11

Estimate Sample Standard Deviation of Data Set document sample

More Info
									                                                   Degrees of freedom


                                     Degrees of Freedom

For most practical purposes (including managerial applications), it suffices to let a
computer count "degrees of freedom," and to use the "approximately 2" multiplier the
computer provides in order to determine the margin of error in an estimate (at the
95%-confidence level).

The notes below, and on the next tabs, explain what's going on behind the scenery.

The variance of a population characteristic is computed using the formula:
                                 N
                                     (x i  μ)2
                         σ2             N
                                i 1

On the other hand, the variance of sample data, used as an estimate of the population
variance, is computed as:
                                 n
                                     (x i  x )2
                         s2           n 1
                                i 1
Why, you might well wonder, is the denominator in the second formula n-1 instead of n ?
Why don't we instead use:
                                 n
                                     (x i  x )2
                   (?)
                                         n
                                i 1


Imagine that you knew the value of μ, the population mean. Then it would be natural -
and correct - to estimate the population variance using the formula:
                                n
                                     (x i  μ)2
                   (!)
                                         n
                                i 1

Since we don't know μ, we cheat just a little and use the sample mean
instead. This means that the same data is being used to make two
distinct estimates, first the mean and then the variance. When we do this,
typically the numbers all fit together a bit too well.
                                                                      sample
We'll use Excel's Solver tool to see this. Listed to the right are         9.34
20 random observations sampled from a population where the                 9.16
characteristic being studied is uniformly distributed between 5            5.27
and 15. The true population mean is 10.                                    9.86
                               n                                           8.90
Consider the function:           (x i  w) 2                             13.75
                             i 1                                         13.52
If w =         10.000 , this function takes the value:        149.634      6.44
To find the value of w which minimizes this function, we select            5.64
Tools, Solver, Solve (I've already set up the problem), and OK.           10.70
Go ahead, try it.                                                         11.69
                                                                          12.38
The resulting value of w is called the "least-squares fit" to the         12.44
sample data. [In fact, regression analysis uses the same                   5.33


                                                        Page 1
                                             Degrees of freedom


least-squares-fit approach to estimate the coefficients of the              7.09
regression model, and we've just done the equivalent of a                  10.04
regression analysis with no independent variables.]                         9.36
                                                                            8.01
The striking fact is that this value of w is exactly equal to the           9.42
sample mean, i.e., the sample mean is the single number which              14.10
"fits" the entire sample better than any other number.

                                               sample mean =                   9.621

Except in the unlikely case that the sample mean is precisely equal to the
true population mean, the sum of the squared deviations of the sample
observations around the sample mean will be strictly smaller than the sum of
the squared deviations around the true mean. Because of this, (?) typically
is an underestimate of (!), and some adjustment must be made. It turns out
that dividing by (n-1) instead of n scales the result up just enough to offset
the downward bias created by using the same data to estimate both the
population mean and the population variance. (This is demonstrated via
simulation on the next tab.)

One useful way to think of all this is to picture the original n sample
observations as each being "free" to take any value. Once we compute an
estimate (such as the sample mean, our estimate of the population mean)
using the data, however, one "degree of freedom" is lost, i.e., for any
sample to yield this particular estimate, once n-1 of the observations were
freely determined, the last observation would have a forced value. Therefore,
any subsequent estimates (such as our estimate of the population variance)
which are made using this first estimate will be based on data with only n-1
remaining degrees of freedom.

Many statistical techniques (in particular, regression analysis) entail the
making of several estimates from the same sample data. Consequently,
throughout much of what we'll be doing, the "counting" of degrees of
freedom will be necessary, in order to correct for the bias in our estimates
that would otherwise result from everything fitting together a bit too well.


Note: The downward bias in the example could also be eliminated by drawing
two separate samples, and using one to estimate population mean, and then
the other to estimate the sample variance (and standard deviation). But this
is less efficient than using a single sample, and then adjusting for the lost
degree of freedom in the second estimate.




                                                    Page 2
                                        Caution

The following tabs contain my best effort to demystify the notion of "degrees of freedom"
and the t-distribution. Read them for personal interest - As we'll see when the course
moves forward into regression analysis, dealing with all of this will ultimately be quite
simple and mechanical. Reading the following tabs is not mandatory.
                   Adjusting for 1 Lost Degree of Freedom

At the right, we're drawing a random sample          50   μ, population mean
of size 10 from a normally-distributed                5   σ, population standard deviation
population. That the observations are                25   σ2, population variance
normally distributed is not important - They         10   n, sample size
could be drawn from a population with any
distribution                                    sample          (xi-– )2
                                                                    x       (xi-μ)2
                                                45.561           20.829  19.700
The sum of the squared deviations from the      59.277           83.751  86.061
sample mean, and from the true mean, are        49.309            0.667   0.478
both computed. As discussed on the              47.007            9.725   8.959
previous tab, the first sum is always smaller   45.447           21.885  20.728
than the second. (Press 'F9' to resample.)      47.587            6.445   5.824
                                                58.932           77.550  79.774
50,000 simulations are used to estimate         51.646            2.312   2.709
the average values of both sums.                49.950            0.031   0.003
                                                46.539           12.865  11.981
In the latter case, the sum of the squared        –              sum     sum
                                                  x
deviations is, on average, just about 10        50.125         236.060 236.217
times the true variance. But the sum of the
squared deviations around the sample                              $I$20    $J$20 monitored cell
mean is only, on average, about 9 = 10-1                        224.968 250.105 mean
times the true variance.                                        106.498 112.269 sample standard deviation
                                                                 11.513   17.701 minimum
This illustrates why we divide by 9, and                        969.745 1096.328 maximum
generally, n-1, to estimate the true variance                    50,000   50,000 number of simulation runs
(and standard deviation) from our sample
data.
monitored cell

sample standard deviation


number of simulation runs
                               What Is This "t-Distribution"?

A central idea in estimation is that
                         _
                         X-μ                               50      μ, population mean
                   Z=
                         σ/√n                               5      σ, population standard deviation
                                                           10      n, sample size
is approximately normally distributed, and therefore
                 _                                        sample
           Pr( | X - μ | ≤ 1.96 σ/√n ) = 95% .            42.008
                                                          56.371
The problem, of course, is that we don’t know σ .         44.285
We deal with this by cheating, and using s instead.       49.851
                                                          57.841
But, if we let S be the random variable that              49.610                  _
eventually takes the value s ,                            53.643                   x           s
                         _                                44.278                50.614       5.531
                         X-μ                              53.765
                   T = S/√n                                                      σ/√n
                                                          54.489                              s/√n
                                                                                 1.581       1.749
isn't quite normally distributed. In particular, if our
sample yields an underestimate for σ , T can take                                  Z           T
values quite a bit more extreme than does Z .                                    0.388       0.351

In the simulation to the right, note how T takes, at                             $I$23     $J$23
times, much more extreme values than does Z .                                    0.000     -0.001
                                                                                 1.000      1.135
                                                                                -4.701    -12.928
                                                                                 4.891     10.813
                                                                               1,000,000 1,000,000
monitored cell
mean
sample standard deviation
minimum
maximum
number of simulation runs
                               What Is This "t-Distribution"?

Here, we'll again draw a sample of size 10 from a
normally-distributed population. This time, the normality
of the population distribution is important if the sample       50      μ, population mean
size is no larger than a couple of dozen observations -          5      σ, population standard deviation
For larger samples, any distribution would yield similar
                                       _                        10      n, sample size
conclusions. (What we need is for X to be roughly
normally-distributed, which will be the case if either the     sample               σ/√n          1.581
                                                                                 _
individual observations come from a normal distribution,       43.778           (x-μ) / (σ/√n)   -1.332
or the sample size is large enough for the Central Limit       48.365
Theorem to kick in.)                                           51.328
                                                               45.187
The simulation estimates the left-tail probabilities for the   50.778
distributions of Z and T.                                      46.927
                                                               42.218           NORMDIST(~,0,1,TRUE)
Notice that these probabilities are all smaller for the        49.060
distribution of T. This shows us that the upper tail of the    45.141
distribution of T is "fatter" than the upper tail of the       56.159
                                                                 _              _ s/√n            1.312
normal distribution, which is what we'd expect from our           x             (x-μ) / (s/√n)   -1.604
previous simulation. (Symmetrically, the lower tail of the     47.894
distribution of T is fatter as well.)




                                                                                     1-TDIST(~,9,TRUE)
                     standard deviations above 0
  0.5          1           1.5          2           2.5       3
   1           1            1           1            1        1       is Z below this?
   1           1            1           1            1        1       is T below this?

                    left-tail probabilities for Z
    $M$5      $N$5          $O$5          $P$5      $Q$5   $R$5      monitored cell
  69.15%    84.14%        93.30%        97.72%    99.38% 99.87%      mean
  46.19%    36.53%        25.00%        14.93%     7.87%  3.62%      sample standard deviation
        0         0              0             0       0      0      minimum
        1         1              1             1       1      1      maximum
1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000          number of simulation runs

69.15%      84.13%       93.32%     97.72%         99.38%   99.87%

                      left-tail probabilities for T
    $M$6        $N$6          $O$6          $P$6      $Q$6   $R$6    monitored cell
  68.58%      82.84%        91.62%        96.14%    98.30% 99.24%    mean
  46.42%      37.70%        27.71%        19.26%    12.94%  8.67%    sample standard deviation
        0           0              0             0       0      0    minimum
        1           1              1             1       1      1    maximum
1,000,000   1,000,000 1,000,000 1,000,000 1,000,000 1,000,000        number of simulation runs

68.55%      82.83%       91.61%     96.17%         98.31%   99.25%
                                What Is This "t-Distribution"?

Excel's TDIST(t,df,TRUE) gives the probability
distribution of T. (Somewhat perversely, this Excel
function is designed to give right-tail probabilities.)       50      μ, population mean
                                                               5      σ, population standard deviation
There is one t-distribution for every different number of     10      n, sample size
degrees of freedom. The number of degrees of freedom
in our example is 10-1 = 9.                                  sample               σ/√n          1.581
                                                                               _
                                                             51.988           (x-μ) / (σ/√n)   -0.317
Notice that the upper tail probability for the normal        49.632
distribution, 2 or more standard deviations above the        46.811
mean, is 2.23%, and therefore going up and down 2            53.133
standard deviations captures a bit more than 95% of the      44.439
distribution. But for the t-distribution with 9 degrees of   49.168
freedom, the upper tail probability 2 or more standard       43.860          NORMDIST(~,0,1,TRUE)
deviations above the mean is greater than 3%, so going       46.448
up and down two standard deviations captures less than       49.121
95% of the distribution.                                     60.382
                                                               _              _ s/√n            1.535
                                                                x             (x-μ) / (s/√n)   -0.327
To capture 95%, we need to go out more than 2                49.498
standard deviations. Indeed, the TINV function shows us
that we need to go up and down            2.262
standard deviations.

As the number of degrees of freedom becomes very                                   1-TDIST(~,9,TRUE)
large, the t-distribution approaches the normal
distribution, and our "approximately 2" multiplier
approaches 1.96 .
                   standard deviations above 0
  0.5          1         1.5          2        2.5        3
   1           1          1           1         1         1         is Z below this?
   1           1          1           1         1         1         is T below this?

                   left-tail probabilities for Z
    $M$5      $N$5         $O$5          $P$5      $Q$5   $R$5     monitored cell
  69.15%    84.14%       93.30%       97.72%     99.38% 99.87%     mean
  46.19%    36.53%       25.00%       14.93%      7.87%  3.62%     sample standard deviation
        0         0             0            0        0      0     minimum
        1         1             1            1        1      1     maximum
1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000        number of simulation runs

69.15%      84.13%     93.32%     97.72%    99.38%     99.87%

                     left-tail probabilities for T
    $M$6        $N$6         $O$6          $P$6      $Q$6   $R$6   monitored cell
  68.58%      82.84%       91.62%       96.14%     98.30% 99.24%   mean
  46.42%      37.70%       27.71%       19.26%     12.94%  8.67%   sample standard deviation
        0           0             0            0        0      0   minimum
        1           1             1            1        1      1   maximum
1,000,000   1,000,000 1,000,000 1,000,000 1,000,000 1,000,000      number of simulation runs

68.55%      82.83%     91.61%     96.17%    98.31%     99.25%

								
To top