VIEWS: 23 PAGES: 11 POSTED ON: 1/14/2011 Public Domain
Do data belong to the Normal Distribution? As you have probably figured out by now, the normal distribution plays a major role in many types of probabilistic and statistical analyses. Some statistical procedures are heavily dependent on the assumption of normality, and in case one can verify that this assumption is questionable, these procedures should be avoided. It is therefore useful to have techniques available that can verify the validity of the normality assumption. This is the objective of this short note. The Normal Probability Plot The following procedure helps conclude qualitatively that a sample was drawn from a normal distribution. Here is a summary of the procedure: 1. Place the values in the data set (X) into an ordered array. Call the smallest value in the ordered set X1 and the largest value Xn. Then the set becomes X1, X2… Xn. 2. Calculate the Fx/(n+1), the cumulative relative frequency for each value Xi. From the chart of the standard normal distribution or from Excel (using “=normsinv[Fx/(n+1)]”), find the corresponding standard normal value of Z for each point in the ordered data set. In doing so we hypothesize the data set was drawn from a normal distribution with some mean and standard deviation. 3. Plot the pairs of points (Z, X) using the observed data values (Xi) on the vertical axis, and the associated Zi values on the horizontal axis. 4. Inspect the points plotted for evidence of linearity (i.e. a straight line). Explanation: The Z-score for any value of X is Z = (X – so there is a linear relationship between X and Z, that is X = Z + Since the empirical probability to have a number as large as X in the sample is Fx/(n+1), (where Fx is the cumulative frequency of X), if X is indeed normally distributed the Z value obtained from the normal distribution for the corresponding X value by using Fx/(n+1) should be the Z-score of that X, thus providing a linear relationship with X. So if there is a linear relationship between X and Z-table, then X is normally distributed. Example 1: Suppose we wish to obtain the first and the second standard normal ordered values (Z1, and Z2) for to a sample of 19 observations (each observation is different in value). Obtaining Z1: Since Fx=1, P(Z<Z1) = 1/(19+1) = 1/20 = .05. Under the standard normal distribution Z1 = -1.645 (note P(Z<-1.645) = .05. Obtaining Z2: Since Fx=2, P(Z<Z2) =2/(19+1) = 2/20 = .10. Under the standard normal distribution Z2 = -1.285 (note P(Z<-1.285) = .10. In a similar manner we complete the rest of the Zi values. Now the pairs (Xi, Zi) are plotted and if they are found to lie (approximately) along a straight line we can safely say, that the data belong to a normal distribution. To determine whether or not there is linear relationship between X and Z we can test the correlation between them as follows: H0: The data come from normal distribution H1: The data do not come from a normal distribution. Calculate the test statistic (R) as the sample correlation coefficient between X and Z. Compare R to a critical value Rcr from a table of critical values (provided below; the table was constructed from simulation results). Rcr depends on the sample size and the significance level selected for the test. If R < Rcr there is sufficient evidence to reject H0 and conclude that the data is not normal at alpha level of significance. Important Comment: If the Xi and Zi appear to form a linear relationship, then the line intercept represents the population mean (), and the line slope represents the standard deviation (). Example 2 Test scores of 19 students in each of two classes were drawn. Some of the sorted scores are shown below along with the calculated cumulative proportion from the sample (Fx/(n+1)) and with the resulting Z values. Details can be found in the file Assess Normal. Partial set: Order Class I Class II Prob Z value 1 48 47 0.05 -1.64485 2 52 54 0.1 -1.28155 3 55 58 0.15 -1.03643 4 57 61 0.2 -0.84162 After the Z values were derived, the following two graphs were plotted. Class I 90 80 X 70 60 50 Z 40 -2 -1 0 1 2 Conclusion: In class I scores were produced from a normal distribution. From the graph it seems = 65 and the = (83 – 47)/(1.645 – (-1.645) = 10.94 Now observe the probability plot for class II Class II 90 80 X 70 60 50 Z 40 -2 -1 0 1 2 The result is unclear. Although it seems there is some curvature in the line the “non- normality” does not appear to be too severe. Since the sample size is only 19 one should not judge the distribution to be non-normal. Let us proceed by testing the correlation as explained above (we‟ll run the correlation test for the two classes): H0: The data come from normal distribution H1: The data do not come from a normal distribution. The test statistic calculated with Excel for Class I : R = .999 The test statistic calculated with Excel for Class II: R =.959 The critical value for n=19, and alpha = .05 is .9479 There is insufficient evidence to reject the normal distribution at 5% significance level for both classes (since .999 > .9479 and .959 > .9479). To estimate and we run linear regression to construct the best fit line, which results with the equation X = 10.89Z +70.684. So ≅70.7 and ≅ 10.89 (see the Excel file). The following example demonstrates how to construct a probability plot when multiple same-values are present in the sample drawn (which did not occur in the previous example). Example 3 To help make a decision about expansion plan, the president of a music company needs to know how many CDs teenagers buy annually. Accordingly, he commissions a survey of 250 teens, in which they are asked to report how many CDs they purchased in the previous 12 months. Can we assume the number of CDs bought annually by a teenager is normally distributed? Solution The following table summarizes the data (see the file AssessNormal1 – the Probability Plot sheet): X f Fx Fx/(n+1) Z 6 1 1 0.003984 -2.65342 8 1 2 0.007968 -2.41037 9 7 9 0.035857 -1.80093 10 10 19 0.075697 -1.43462 11 16 35 0.139442 -1.08283 12 26 61 0.243028 -0.6966 13 23 84 0.334661 -0.42708 14 25 109 0.434263 -0.16553 15 29 138 0.549801 0.125158 16 28 166 0.661355 0.416163 17 26 192 0.76494 0.722285 18 29 221 0.880478 1.17738 19 11 232 0.924303 1.434623 20 11 243 0.968127 1.853959 21 4 247 0.984064 2.146006 22 1 248 0.988048 2.258663 23 1 249 0.992032 2.410372 26 1 250 0.996016 2.653417 Explanations: The column „X‟ represents the number of CDs purchased by a teenager annually. The column „f‟ is the frequency of X (counts how many times each number appears in the sample). For example, the value 11 appears 16 times (16 teenagers purchased 11 CDs). The column „Fx‟ calculates the cumulative frequency. For example, 10 or less CDs per person appear 19 times (1+1+7+10=19). The column „Fx/(n+1)‟ calculates the empirical cumulative frequency. For example, F10/(250+1) = 19/251 = .075697. „Z‟ is found by “normsinv” as before. Now we can draw the graph of Z against X. 30 25 20 15 10 5 0 0 5 10 15 20 Interpretation: The graph raises some suspicion with regard to the normality of the CD s distribution. Because the two ends are curved. Yet the amount of deviation from the normal curve needs to be rechecked. The correlation test used above yields the following results: R = .990375; Rcr = .9943 (for n = 250, alpha = .05). Thus there is insufficient evidence to reject the normality at 5% level of significance. In what follows we present a few hypotheses testing procedures designed to analytically test the normality of a data set. The Goodness of Fit Chi Squared Test Example 4 Re-solve example 3 using the goodness of fit Chi square test at 5% significance level. Solution: First, determine Z values that comply with the rule of 5 (the expected value of the number of observation that fall in each interval should be at least 5). The following table demonstrates such a selection of Z values, and additional information: i Intervals Probability Expected (Ei) Actual (Fi) 1 (z -2) 0.02275 5.6875 2 2 (-2 < z -1) 0.135905 33.97625 33 3 (-1 < z 0) 0.341345 85.33625 74 4 (0 < z 1) 0.341345 85.33625 112 5 (1 < z 2) 0.135905 33.97625 26 6 (z > 2) 0.02275 5.6875 3 Explanations: Determine the probabilities for the ranges selected. P(Z -2)=.0225; P(-2 Z -1) = .1359; Comment: The Z values (-2, -1, 0, 1, 2) were selected such that when the interval probabilities are calculated the expected number of observation in each one (Ei) will be at least 5. See details below. A symmetrical selection of Z values is preferable. The expected values (Ei) are calculated as follows: First interval: Second Interval: E1 = P(Z -2)(250) = 5.6875 E2 = P(-2 Z -1)(250) = 33.97625 …and so on… The actual frequency (Fi) counts the number of sample observations in each interval. Of course you need to transform first the observation values Xi to their corresponding Z- scores using the sample mean and sample standard deviation: X i 14.98 Zi , and then count how many Z values belong to each interval. 3.14 For example, in the interval Z -2 there are two Z-scores found so F1 = 2. Test the following hypothesis: H0: The distribution is normal with = 14.98 and = 3.14 H1: The distribution is not the above The test is performed using a Chi-square distribution. Use Ei and Fi to calculate the Chi square statistic. k (E i Fi ) 2 ( 5.6875 - 2) 2 ( 33.97625 - 33) 2 ( 85.33625 - 74) 2 χ2 ... 15.39 i 1 Ei 5.6875 33.97625 85.33625 The test is performed as follows: If 2 > 2, k-1-L, reject H0 (where k is the number of intervals and L is the number of parameters estimated; since we estimate both and L=2). Let the significance level be .05.This rule translates to a critical value of 2.05, 6-1-2 = 7.8147 (a value found in the chi-square table or by using the Excel function: =chiinv(.05,3)). Conclusion: Since 15.39 > 7.8147, there is sufficient evidence to reject H0 at 5% significance level. The distribution is not normal with = 14.98, and = 3.14. Anderson Darling Test This is a very strong test that works well on small samples (even n≤25). The test is performed on the ordered data set (X1 ≤X2…≤Xn). It applies to any distribution. Specifically for the normal case define: (2i 1)ln(zi ) (2(n i) 1)ln(1 Φ(zi )) 1 n A 2 n n i 1 xi x Zi is calculated by zi where x and s are the sample mean and standard s deviation respectively. Also Φ(zi) = Pr(Z < zi) of the normal distribution. Now calculate the statistic (A*)2, the adjustment of A2 to the sample size (especially important for small samples) by 0.75 2.25 (A*) 2 A 2 1 2 n n . If (A*)2 > A2crit the hypothesis of normality is rejected. Below you can view a few critical values A2crit. 0.1 0.05 0.025 0.01 A2 crit 0.631 0.752 0.873 1.035 Example 6 For the data used in example 3 here is a summary of the calculations: A2 = -250 – (1/250)[(2(1)-1)Ln(z1)+(2(250-1)+1)Ln(1-(z1)+ (2(2)-1)Ln(z2)+(2(250-2)+1)Ln(1-(z2)+…… = 1.42 (A*)2 = 1.42(1+.75/250+2.25/2502) = 1.43 Find details in the file AssessNormal1- Anderson Darling CD example. A2crit for 5% significance level = .752 Since 1.43 > .752 there is sufficient evidence at 5% significance level to reject the null hypothesis. The sample does not belong to a normal distribution. The Lilliefors Test This hypothesis test method is known to give very strong results for samples of size n2000. As in the normal plot approach, here too we calculate cumulative probabilities. Yet here we compare probabilities for a known normal distribution with their sample based empirical counterparts. Here is a summary of the procedure: 1. Determine the mean and standard deviation of the normal distribution under investigation. Set up the hypotheses: H0: The distribution is normal with and. H1: The distribution is not normal. 2. Place the values in the data set (X) into an ordered array. 3. Find the corresponding standard normal Zi values for each point in the ordered data set using the hypothesized mean and standard deviation. That is Zi=(Xi- 4. Determine the cumulative normal probabilities F(Zi) = P(Z<Zi) for each Zi value found in part „2‟. 5. Determine the cumulative sample distribution S(Xi) = Fx/n for each point in the sample. 6. Calculate the largest absolute difference (D) between F(*) and S(*). D = max{|F(Z1)-S(X1)|, |F(Z2)-S(X2)|…, |F(Zn)-S(Xn)|} 7. Perform the test as follows: If D>Dcr, reject the null hypothesis. Otherwise, do not reject the null hypothesis. Dcr is a critical value determined by alpha and the sample size, and is provided by the Lilliefors table (see below). The Lilliefors method was applied to a data set of n = 2000, that can be found in AssessNormal1 – Lilliefors; all the calculations were performed in Excel. Appendix 1: The Lilliefors Table Appendix 2 The Critical value of correlation for the probability plot normality test N 0.01 0.05 3 0.8687 0.8790 4 0.8234 0.8666 5 0.8240 0.8786 6 0.8351 0.8880 7 0.8474 0.8970 8 0.8590 0.9043 9 0.8689 0.9115 10 0.8765 0.9173 11 0.8838 0.9223 12 0.8918 0.9267 13 0.8974 0.9310 14 0.9029 0.9343 15 0.9080 0.9376 16 0.9121 0.9405 17 0.9160 0.9433 18 0.9196 0.9452 19 0.9230 0.9479 20 0.9256 0.9498 21 0.9285 0.9515 22 0.9308 0.9535 23 0.9334 0.9548 24 0.9356 0.9564 25 0.9370 0.9575 26 0.9393 0.9590 27 0.9413 0.9600 28 0.9428 0.9615 29 0.9441 0.9622 30 0.9462 0.9634 31 0.9476 0.9644 32 0.9490 0.9652 33 0.9505 0.9661 34 0.9521 0.9671 35 0.9530 0.9678 36 0.9540 0.9686 37 0.9551 0.9693 38 0.9555 0.9700 39 0.9568 0.9704 40 0.9576 0.9712 41 0.9589 0.9719 42 0.9593 0.9723 43 0.9609 0.9730 44 0.9611 0.9734 45 0.9620 0.9739 46 0.9629 0.9744 47 0.9637 0.9748 48 0.9640 0.9753 49 0.9643 0.9758 50 0.9654 0.9761 N .01 .05 55 0.9683 0.9781 60 0.9706 0.9797 65 0.9723 0.9809 70 0.9742 0.9822 N .01 .05 75 0.9758 0.9831 675 0.9969 0.9977 80 0.9771 0.9841 750 0.9972 0.9980 85 0.9784 0.9850 775 0.9973 0.9980 90 0.9797 0.9857 800 0.9974 0.9981 95 0.9804 0.9864 825 0.9975 0.9981 100 0.9814 0.9869 850 0.9975 0.9982 110 0.9830 0.9881 875 0.9976 0.9982 120 0.9841 0.9889 900 0.9977 0.9983 130 0.9854 0.9897 925 0.9977 0.9983 140 0.9865 0.9904 950 0.9978 0.9984 150 0.9871 0.9909 975 0.9978 0.9984 160 0.9879 0.9915 1000 0.9979 0.9984 170 0.9887 0.9919 180 0.9891 0.9923 190 0.9897 0.9927 200 0.9903 0.9930 210 0.9907 0.9933 220 0.9910 0.9936 230 0.9914 0.9939 240 0.9917 0.9941 250 0.9921 0.9943 260 0.9924 0.9945 270 0.9926 0.9947 280 0.9929 0.9949 290 0.9931 0.9951 300 0.9933 0.9952 310 0.9936 0.9954 320 0.9937 0.9955 330 0.9939 0.9956 340 0.9941 0.9957 350 0.9942 0.9958 360 0.9944 0.9959 370 0.9945 0.9960 380 0.9947 0.9961 390 0.9948 0.9962 400 0.9949 0.9963 410 0.9950 0.9964 420 0.9951 0.9965 430 0.9953 0.9966 440 0.9954 0.9966 450 0.9954 0.9967 460 0.9955 0.9968 470 0.9956 0.9968 480 0.9957 0.9969 490 0.9958 0.9969 500 0.9959 0.9970 525 0.9961 0.9972 550 0.9963 0.9973 575 0.9964 0.9974 600 0.9965 0.9975 625 0.9967 0.9976 650 0.9968 0.9977