VIEWS: 0 PAGES: 11 CATEGORY: Business POSTED ON: 11/22/2010
Estimate Sample Standard Deviation of Data Set document sample
Degrees of freedom Degrees of Freedom For most practical purposes (including managerial applications), it suffices to let a computer count "degrees of freedom," and to use the "approximately 2" multiplier the computer provides in order to determine the margin of error in an estimate (at the 95%-confidence level). The notes below, and on the next tabs, explain what's going on behind the scenery. The variance of a population characteristic is computed using the formula: N (x i μ)2 σ2 N i 1 On the other hand, the variance of sample data, used as an estimate of the population variance, is computed as: n (x i x )2 s2 n 1 i 1 Why, you might well wonder, is the denominator in the second formula n-1 instead of n ? Why don't we instead use: n (x i x )2 (?) n i 1 Imagine that you knew the value of μ, the population mean. Then it would be natural - and correct - to estimate the population variance using the formula: n (x i μ)2 (!) n i 1 Since we don't know μ, we cheat just a little and use the sample mean instead. This means that the same data is being used to make two distinct estimates, first the mean and then the variance. When we do this, typically the numbers all fit together a bit too well. sample We'll use Excel's Solver tool to see this. Listed to the right are 9.34 20 random observations sampled from a population where the 9.16 characteristic being studied is uniformly distributed between 5 5.27 and 15. The true population mean is 10. 9.86 n 8.90 Consider the function: (x i w) 2 13.75 i 1 13.52 If w = 10.000 , this function takes the value: 149.634 6.44 To find the value of w which minimizes this function, we select 5.64 Tools, Solver, Solve (I've already set up the problem), and OK. 10.70 Go ahead, try it. 11.69 12.38 The resulting value of w is called the "least-squares fit" to the 12.44 sample data. [In fact, regression analysis uses the same 5.33 Page 1 Degrees of freedom least-squares-fit approach to estimate the coefficients of the 7.09 regression model, and we've just done the equivalent of a 10.04 regression analysis with no independent variables.] 9.36 8.01 The striking fact is that this value of w is exactly equal to the 9.42 sample mean, i.e., the sample mean is the single number which 14.10 "fits" the entire sample better than any other number. sample mean = 9.621 Except in the unlikely case that the sample mean is precisely equal to the true population mean, the sum of the squared deviations of the sample observations around the sample mean will be strictly smaller than the sum of the squared deviations around the true mean. Because of this, (?) typically is an underestimate of (!), and some adjustment must be made. It turns out that dividing by (n-1) instead of n scales the result up just enough to offset the downward bias created by using the same data to estimate both the population mean and the population variance. (This is demonstrated via simulation on the next tab.) One useful way to think of all this is to picture the original n sample observations as each being "free" to take any value. Once we compute an estimate (such as the sample mean, our estimate of the population mean) using the data, however, one "degree of freedom" is lost, i.e., for any sample to yield this particular estimate, once n-1 of the observations were freely determined, the last observation would have a forced value. Therefore, any subsequent estimates (such as our estimate of the population variance) which are made using this first estimate will be based on data with only n-1 remaining degrees of freedom. Many statistical techniques (in particular, regression analysis) entail the making of several estimates from the same sample data. Consequently, throughout much of what we'll be doing, the "counting" of degrees of freedom will be necessary, in order to correct for the bias in our estimates that would otherwise result from everything fitting together a bit too well. Note: The downward bias in the example could also be eliminated by drawing two separate samples, and using one to estimate population mean, and then the other to estimate the sample variance (and standard deviation). But this is less efficient than using a single sample, and then adjusting for the lost degree of freedom in the second estimate. Page 2 Caution The following tabs contain my best effort to demystify the notion of "degrees of freedom" and the t-distribution. Read them for personal interest - As we'll see when the course moves forward into regression analysis, dealing with all of this will ultimately be quite simple and mechanical. Reading the following tabs is not mandatory. Adjusting for 1 Lost Degree of Freedom At the right, we're drawing a random sample 50 μ, population mean of size 10 from a normally-distributed 5 σ, population standard deviation population. That the observations are 25 σ2, population variance normally distributed is not important - They 10 n, sample size could be drawn from a population with any distribution sample (xi-– )2 x (xi-μ)2 45.561 20.829 19.700 The sum of the squared deviations from the 59.277 83.751 86.061 sample mean, and from the true mean, are 49.309 0.667 0.478 both computed. As discussed on the 47.007 9.725 8.959 previous tab, the first sum is always smaller 45.447 21.885 20.728 than the second. (Press 'F9' to resample.) 47.587 6.445 5.824 58.932 77.550 79.774 50,000 simulations are used to estimate 51.646 2.312 2.709 the average values of both sums. 49.950 0.031 0.003 46.539 12.865 11.981 In the latter case, the sum of the squared – sum sum x deviations is, on average, just about 10 50.125 236.060 236.217 times the true variance. But the sum of the squared deviations around the sample $I$20 $J$20 monitored cell mean is only, on average, about 9 = 10-1 224.968 250.105 mean times the true variance. 106.498 112.269 sample standard deviation 11.513 17.701 minimum This illustrates why we divide by 9, and 969.745 1096.328 maximum generally, n-1, to estimate the true variance 50,000 50,000 number of simulation runs (and standard deviation) from our sample data. monitored cell sample standard deviation number of simulation runs What Is This "t-Distribution"? A central idea in estimation is that _ X-μ 50 μ, population mean Z= σ/√n 5 σ, population standard deviation 10 n, sample size is approximately normally distributed, and therefore _ sample Pr( | X - μ | ≤ 1.96 σ/√n ) = 95% . 42.008 56.371 The problem, of course, is that we don’t know σ . 44.285 We deal with this by cheating, and using s instead. 49.851 57.841 But, if we let S be the random variable that 49.610 _ eventually takes the value s , 53.643 x s _ 44.278 50.614 5.531 X-μ 53.765 T = S/√n σ/√n 54.489 s/√n 1.581 1.749 isn't quite normally distributed. In particular, if our sample yields an underestimate for σ , T can take Z T values quite a bit more extreme than does Z . 0.388 0.351 In the simulation to the right, note how T takes, at $I$23 $J$23 times, much more extreme values than does Z . 0.000 -0.001 1.000 1.135 -4.701 -12.928 4.891 10.813 1,000,000 1,000,000 monitored cell mean sample standard deviation minimum maximum number of simulation runs What Is This "t-Distribution"? Here, we'll again draw a sample of size 10 from a normally-distributed population. This time, the normality of the population distribution is important if the sample 50 μ, population mean size is no larger than a couple of dozen observations - 5 σ, population standard deviation For larger samples, any distribution would yield similar _ 10 n, sample size conclusions. (What we need is for X to be roughly normally-distributed, which will be the case if either the sample σ/√n 1.581 _ individual observations come from a normal distribution, 43.778 (x-μ) / (σ/√n) -1.332 or the sample size is large enough for the Central Limit 48.365 Theorem to kick in.) 51.328 45.187 The simulation estimates the left-tail probabilities for the 50.778 distributions of Z and T. 46.927 42.218 NORMDIST(~,0,1,TRUE) Notice that these probabilities are all smaller for the 49.060 distribution of T. This shows us that the upper tail of the 45.141 distribution of T is "fatter" than the upper tail of the 56.159 _ _ s/√n 1.312 normal distribution, which is what we'd expect from our x (x-μ) / (s/√n) -1.604 previous simulation. (Symmetrically, the lower tail of the 47.894 distribution of T is fatter as well.) 1-TDIST(~,9,TRUE) standard deviations above 0 0.5 1 1.5 2 2.5 3 1 1 1 1 1 1 is Z below this? 1 1 1 1 1 1 is T below this? left-tail probabilities for Z $M$5 $N$5 $O$5 $P$5 $Q$5 $R$5 monitored cell 69.15% 84.14% 93.30% 97.72% 99.38% 99.87% mean 46.19% 36.53% 25.00% 14.93% 7.87% 3.62% sample standard deviation 0 0 0 0 0 0 minimum 1 1 1 1 1 1 maximum 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 number of simulation runs 69.15% 84.13% 93.32% 97.72% 99.38% 99.87% left-tail probabilities for T $M$6 $N$6 $O$6 $P$6 $Q$6 $R$6 monitored cell 68.58% 82.84% 91.62% 96.14% 98.30% 99.24% mean 46.42% 37.70% 27.71% 19.26% 12.94% 8.67% sample standard deviation 0 0 0 0 0 0 minimum 1 1 1 1 1 1 maximum 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 number of simulation runs 68.55% 82.83% 91.61% 96.17% 98.31% 99.25% What Is This "t-Distribution"? Excel's TDIST(t,df,TRUE) gives the probability distribution of T. (Somewhat perversely, this Excel function is designed to give right-tail probabilities.) 50 μ, population mean 5 σ, population standard deviation There is one t-distribution for every different number of 10 n, sample size degrees of freedom. The number of degrees of freedom in our example is 10-1 = 9. sample σ/√n 1.581 _ 51.988 (x-μ) / (σ/√n) -0.317 Notice that the upper tail probability for the normal 49.632 distribution, 2 or more standard deviations above the 46.811 mean, is 2.23%, and therefore going up and down 2 53.133 standard deviations captures a bit more than 95% of the 44.439 distribution. But for the t-distribution with 9 degrees of 49.168 freedom, the upper tail probability 2 or more standard 43.860 NORMDIST(~,0,1,TRUE) deviations above the mean is greater than 3%, so going 46.448 up and down two standard deviations captures less than 49.121 95% of the distribution. 60.382 _ _ s/√n 1.535 x (x-μ) / (s/√n) -0.327 To capture 95%, we need to go out more than 2 49.498 standard deviations. Indeed, the TINV function shows us that we need to go up and down 2.262 standard deviations. As the number of degrees of freedom becomes very 1-TDIST(~,9,TRUE) large, the t-distribution approaches the normal distribution, and our "approximately 2" multiplier approaches 1.96 . standard deviations above 0 0.5 1 1.5 2 2.5 3 1 1 1 1 1 1 is Z below this? 1 1 1 1 1 1 is T below this? left-tail probabilities for Z $M$5 $N$5 $O$5 $P$5 $Q$5 $R$5 monitored cell 69.15% 84.14% 93.30% 97.72% 99.38% 99.87% mean 46.19% 36.53% 25.00% 14.93% 7.87% 3.62% sample standard deviation 0 0 0 0 0 0 minimum 1 1 1 1 1 1 maximum 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 number of simulation runs 69.15% 84.13% 93.32% 97.72% 99.38% 99.87% left-tail probabilities for T $M$6 $N$6 $O$6 $P$6 $Q$6 $R$6 monitored cell 68.58% 82.84% 91.62% 96.14% 98.30% 99.24% mean 46.42% 37.70% 27.71% 19.26% 12.94% 8.67% sample standard deviation 0 0 0 0 0 0 minimum 1 1 1 1 1 1 maximum 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 1,000,000 number of simulation runs 68.55% 82.83% 91.61% 96.17% 98.31% 99.25%