1. Notes on Normality by qingyunliuliu

VIEWS: 23 PAGES: 11

									              Do data belong to the Normal Distribution?
As you have probably figured out by now, the normal distribution plays a major role in
many types of probabilistic and statistical analyses. Some statistical procedures are
heavily dependent on the assumption of normality, and in case one can verify that this
assumption is questionable, these procedures should be avoided. It is therefore useful to
have techniques available that can verify the validity of the normality assumption. This is
the objective of this short note.

The Normal Probability Plot

The following procedure helps conclude qualitatively that a sample was drawn from a
normal distribution. Here is a summary of the procedure:

   1. Place the values in the data set (X) into an ordered array. Call the smallest value
      in the ordered set X1 and the largest value Xn. Then the set becomes X1, X2… Xn.
   2. Calculate the Fx/(n+1), the cumulative relative frequency for each value Xi. From
      the chart of the standard normal distribution or from Excel (using
      “=normsinv[Fx/(n+1)]”), find the corresponding standard normal value of Z for
      each point in the ordered data set. In doing so we hypothesize the data set was
      drawn from a normal distribution with some mean and standard deviation.
   3. Plot the pairs of points (Z, X) using the observed data values (Xi) on the vertical
      axis, and the associated Zi values on the horizontal axis.
   4. Inspect the points plotted for evidence of linearity (i.e. a straight line).

Explanation:
The Z-score for any value of X is Z = (X – so there is a linear relationship between
X and Z, that is X = Z + Since the empirical probability to have a number as large as
X in the sample is Fx/(n+1), (where Fx is the cumulative frequency of X), if X is indeed
normally distributed the Z value obtained from the normal distribution for the
corresponding X value by using Fx/(n+1) should be the Z-score of that X, thus providing
a linear relationship with X. So if there is a linear relationship between X and Z-table,
then X is normally distributed.

Example 1:
Suppose we wish to obtain the first and the second standard normal ordered values (Z1,
and Z2) for to a sample of 19 observations (each observation is different in value).

Obtaining Z1: Since Fx=1, P(Z<Z1) = 1/(19+1) = 1/20 = .05. Under the standard normal
distribution Z1 = -1.645 (note P(Z<-1.645) = .05.

Obtaining Z2: Since Fx=2, P(Z<Z2) =2/(19+1) = 2/20 = .10. Under the standard normal
distribution Z2 = -1.285 (note P(Z<-1.285) = .10.

In a similar manner we complete the rest of the Zi values. Now the pairs (Xi, Zi) are
plotted and if they are found to lie (approximately) along a straight line we can safely
say, that the data belong to a normal distribution.

To determine whether or not there is linear relationship between X and Z we can test the
correlation between them as follows:

H0: The data come from normal distribution
H1: The data do not come from a normal distribution.

Calculate the test statistic (R) as the sample correlation coefficient between X and Z.
Compare R to a critical value Rcr from a table of critical values (provided below; the table
was constructed from simulation results). Rcr depends on the sample size and the
significance level selected for the test. If R < Rcr there is sufficient evidence to reject H0
and conclude that the data is not normal at alpha level of significance.

Important Comment: If the Xi and Zi appear to form a linear relationship, then the line
intercept represents the population mean (), and the line slope represents the standard
deviation ().

Example 2
Test scores of 19 students in each of two classes were drawn. Some of the sorted scores
are shown below along with the calculated cumulative proportion from the sample
(Fx/(n+1)) and with the resulting Z values. Details can be found in the file Assess
Normal.

Partial set:
                   Order      Class I   Class II       Prob         Z value
                      1         48        47            0.05       -1.64485
                      2         52        54             0.1       -1.28155
                      3         55        58            0.15       -1.03643
                      4         57        61             0.2       -0.84162

After the Z values were derived, the following two graphs were plotted.
                                             Class I


                                            90
                                            80     X
                                            70
                                            60
                                            50                          Z
                                            40
                      -2          -1             0             1              2



Conclusion: In class I scores were produced from a normal distribution.
From the graph it seems  = 65 and the  = (83 – 47)/(1.645 – (-1.645) = 10.94
Now observe the probability plot for class II
                                             Class II


                                             90
                                             80       X
                                             70
                                             60
                                             50
                                                                    Z
                                             40
                       -2          -1             0         1           2


The result is unclear. Although it seems there is some curvature in the line the “non-
normality” does not appear to be too severe. Since the sample size is only 19 one should
not judge the distribution to be non-normal. Let us proceed by testing the correlation as
explained above (we‟ll run the correlation test for the two classes):

H0: The data come from normal distribution
H1: The data do not come from a normal distribution.

The test statistic calculated with Excel for Class I : R = .999
The test statistic calculated with Excel for Class II: R =.959
The critical value for n=19, and alpha = .05 is .9479

There is insufficient evidence to reject the normal distribution at 5% significance level
for both classes (since .999 > .9479 and .959 > .9479). To estimate  and  we run linear
regression to construct the best fit line, which results with the equation
X = 10.89Z +70.684. So  ≅70.7 and  ≅ 10.89 (see the Excel file).

The following example demonstrates how to construct a probability plot when multiple
same-values are present in the sample drawn (which did not occur in the previous
example).

Example 3
To help make a decision about expansion plan, the president of a music company needs
to know how many CDs teenagers buy annually. Accordingly, he commissions a survey
of 250 teens, in which they are asked to report how many CDs they purchased in the
previous 12 months. Can we assume the number of CDs bought annually by a teenager is
normally distributed?
Solution
The following table summarizes the data (see the file AssessNormal1 – the Probability
Plot sheet):

             X           f          Fx      Fx/(n+1)       Z
              6          1           1      0.003984   -2.65342
              8          1           2      0.007968   -2.41037
              9          7           9      0.035857   -1.80093
             10         10          19      0.075697   -1.43462
             11         16          35      0.139442   -1.08283
             12         26          61      0.243028     -0.6966
             13         23          84      0.334661   -0.42708
             14         25         109      0.434263   -0.16553
             15         29         138      0.549801   0.125158
             16         28         166      0.661355   0.416163
             17         26         192       0.76494   0.722285
             18         29         221      0.880478    1.17738
             19         11         232      0.924303   1.434623
             20         11         243      0.968127   1.853959
             21          4         247      0.984064   2.146006
             22          1         248      0.988048   2.258663
             23          1         249      0.992032   2.410372
             26          1         250      0.996016   2.653417

Explanations:
    The column „X‟ represents the number of CDs purchased by a teenager annually.
    The column „f‟ is the frequency of X (counts how many times each number
      appears in the sample). For example, the value 11 appears 16 times (16 teenagers
      purchased 11 CDs).
    The column „Fx‟ calculates the cumulative frequency. For example, 10 or less
      CDs per person appear 19 times (1+1+7+10=19).
    The column „Fx/(n+1)‟ calculates the empirical cumulative frequency. For
      example, F10/(250+1) = 19/251 = .075697.
    „Z‟ is found by “normsinv” as before.
Now we can draw the graph of Z against X.
                    30

                    25

                    20

                    15

                    10

                     5

                     0
                         0            5           10         15          20




Interpretation:
The graph raises some suspicion with regard to the normality of the CD s distribution.
Because the two ends are curved. Yet the amount of deviation from the normal curve
needs to be rechecked. The correlation test used above yields the following results:
R = .990375; Rcr = .9943 (for n = 250, alpha = .05). Thus there is insufficient evidence
to reject the normality at 5% level of significance.

In what follows we present a few hypotheses testing procedures designed to analytically
test the normality of a data set.

The Goodness of Fit Chi Squared Test

Example 4
Re-solve example 3 using the goodness of fit Chi square test at 5% significance level.

Solution:
First, determine Z values that comply with the rule of 5 (the expected value of the
number of observation that fall in each interval should be at least 5). The following table
demonstrates such a selection of Z values, and additional information:
 i         Intervals         Probability   Expected (Ei)   Actual (Fi)
 1          (z -2)          0.02275        5.6875             2
 2       (-2 < z  -1)       0.135905       33.97625           33
 3       (-1 < z  0)        0.341345       85.33625           74
 4        (0 < z  1)        0.341345       85.33625          112
 5        (1 < z  2)        0.135905       33.97625           26
 6           (z > 2)          0.02275        5.6875             3
        Explanations:
         Determine the probabilities for the ranges selected.
         P(Z -2)=.0225;
       P(-2 Z -1) = .1359;
       Comment: The Z values (-2, -1, 0, 1, 2) were selected such that when the interval
       probabilities are calculated the expected number of observation in each one (Ei)
       will be at least 5. See details below. A symmetrical selection of Z values is
       preferable.
      The expected values (Ei) are calculated as follows:
       First interval:                                Second Interval:
       E1 = P(Z -2)(250) = 5.6875             E2 = P(-2 Z  -1)(250) = 33.97625
       …and so on…
      The actual frequency (Fi) counts the number of sample observations in each
       interval. Of course you need to transform first the observation values Xi to their
       corresponding Z- scores using the sample mean and sample standard deviation:
                  X i  14.98
           Zi                , and then count how many Z values belong to each interval.
                      3.14
       For example, in the interval Z -2 there are two Z-scores found so F1 = 2.
      Test the following hypothesis:
       H0: The distribution is normal with  = 14.98 and  = 3.14
       H1: The distribution is not the above


       The test is performed using a Chi-square distribution. Use Ei and Fi to calculate
       the Chi square statistic.

       k
            (E i  Fi ) 2 ( 5.6875 - 2) 2 ( 33.97625 - 33) 2 ( 85.33625 - 74) 2
χ2                                                                          ...  15.39
       i 1      Ei           5.6875          33.97625           85.33625


The test is performed as follows: If 2 > 2, k-1-L, reject H0 (where k is the number of
intervals and L is the number of parameters estimated; since we estimate both and 
L=2).
Let the significance level be .05.This rule translates to a critical value of 2.05, 6-1-2 =
7.8147 (a value found in the chi-square table or by using the Excel function:
=chiinv(.05,3)).

Conclusion: Since 15.39 > 7.8147, there is sufficient evidence to reject H0 at 5%
significance level. The distribution is not normal with  = 14.98, and  = 3.14.
Anderson Darling Test
This is a very strong test that works well on small samples (even n≤25). The test is
performed on the ordered data set (X1 ≤X2…≤Xn). It applies to any distribution.
Specifically for the normal case define:

                   (2i  1)ln(zi )  (2(n  i)  1)ln(1  Φ(zi ))
                1 n
A 2  n 
                n i 1

                              xi  x
Zi is calculated by    zi             where    x and s are the sample mean and standard
                                 s
deviation respectively. Also Φ(zi) = Pr(Z < zi) of the normal distribution.

Now calculate the statistic (A*)2, the adjustment of A2 to the sample size (especially
important for small samples) by

              0.75 2.25 
(A*) 2  A 2 1     2 
                 n   n 
.
If (A*)2 > A2crit the hypothesis of normality is rejected. Below you can view a few critical
values A2crit.


                        0.1            0.05           0.025           0.01
             A2 crit     0.631          0.752           0.873          1.035

Example 6
For the data used in example 3 here is a summary of the calculations:
A2 = -250 – (1/250)[(2(1)-1)Ln(z1)+(2(250-1)+1)Ln(1-(z1)+
                       (2(2)-1)Ln(z2)+(2(250-2)+1)Ln(1-(z2)+…… = 1.42
(A*)2 = 1.42(1+.75/250+2.25/2502) = 1.43
Find details in the file AssessNormal1- Anderson Darling CD example.
A2crit for 5% significance level = .752
Since 1.43 > .752 there is sufficient evidence at 5% significance level to reject the null
hypothesis. The sample does not belong to a normal distribution.
The Lilliefors Test
This hypothesis test method is known to give very strong results for samples of size
n2000. As in the normal plot approach, here too we calculate cumulative probabilities.
Yet here we compare probabilities for a known normal distribution with their sample
based empirical counterparts.
Here is a summary of the procedure:

   1. Determine the mean and standard deviation of the normal distribution under
      investigation. Set up the hypotheses:
               H0: The distribution is normal with and.
               H1: The distribution is not normal.
   2. Place the values in the data set (X) into an ordered array.
   3. Find the corresponding standard normal Zi values for each point in the ordered
      data set using the hypothesized mean and standard deviation. That is
      Zi=(Xi-
   4. Determine the cumulative normal probabilities F(Zi) = P(Z<Zi) for each Zi value
      found in part „2‟.
   5. Determine the cumulative sample distribution S(Xi) = Fx/n for each point in the
      sample.
   6. Calculate the largest absolute difference (D) between F(*) and S(*).
      D = max{|F(Z1)-S(X1)|, |F(Z2)-S(X2)|…, |F(Zn)-S(Xn)|}
   7. Perform the test as follows: If D>Dcr, reject the null hypothesis. Otherwise, do not
      reject the null hypothesis. Dcr is a critical value determined by alpha and the
      sample size, and is provided by the Lilliefors table (see below).

The Lilliefors method was applied to a data set of n = 2000, that can be found in
AssessNormal1 – Lilliefors; all the calculations were performed in Excel.
Appendix 1: The Lilliefors Table
                                 Appendix 2

The Critical value of correlation for the probability plot normality test
N             0.01            0.05
     3       0.8687          0.8790
     4       0.8234          0.8666
     5       0.8240          0.8786
     6       0.8351          0.8880
     7       0.8474          0.8970
     8       0.8590          0.9043
     9       0.8689          0.9115
    10       0.8765          0.9173
    11       0.8838          0.9223
    12       0.8918          0.9267
    13       0.8974          0.9310
    14       0.9029          0.9343
    15       0.9080          0.9376
    16       0.9121          0.9405
    17       0.9160          0.9433
    18       0.9196          0.9452
    19       0.9230          0.9479
    20       0.9256          0.9498
    21       0.9285          0.9515
    22       0.9308          0.9535
    23       0.9334          0.9548
    24       0.9356          0.9564
    25       0.9370          0.9575
    26       0.9393          0.9590
    27       0.9413          0.9600
    28       0.9428          0.9615
    29       0.9441          0.9622
    30       0.9462          0.9634
    31       0.9476          0.9644
    32       0.9490          0.9652
    33       0.9505          0.9661
    34       0.9521          0.9671
    35       0.9530          0.9678
    36       0.9540          0.9686
    37       0.9551          0.9693
    38       0.9555          0.9700
    39       0.9568          0.9704
    40       0.9576          0.9712
    41       0.9589          0.9719
    42       0.9593          0.9723
    43       0.9609          0.9730
    44       0.9611          0.9734
    45       0.9620          0.9739
    46       0.9629          0.9744
    47       0.9637          0.9748
    48       0.9640          0.9753
    49       0.9643          0.9758
    50       0.9654          0.9761
 N     .01      .05
 55   0.9683   0.9781
 60   0.9706   0.9797
 65   0.9723   0.9809
 70   0.9742   0.9822    N     .01       .05
 75   0.9758   0.9831    675   0.9969   0.9977
 80   0.9771   0.9841    750   0.9972   0.9980
 85   0.9784   0.9850    775   0.9973   0.9980
 90   0.9797   0.9857    800   0.9974   0.9981
 95   0.9804   0.9864    825   0.9975   0.9981
100   0.9814   0.9869    850   0.9975   0.9982
110   0.9830   0.9881    875   0.9976   0.9982
120   0.9841   0.9889    900   0.9977   0.9983
130   0.9854   0.9897    925   0.9977   0.9983
140   0.9865   0.9904    950   0.9978   0.9984
150   0.9871   0.9909    975   0.9978   0.9984
160   0.9879   0.9915   1000   0.9979   0.9984
170   0.9887   0.9919
180   0.9891   0.9923
190   0.9897   0.9927
200   0.9903   0.9930
210   0.9907   0.9933
220   0.9910   0.9936
230   0.9914   0.9939
240   0.9917   0.9941
250   0.9921   0.9943
260   0.9924   0.9945
270   0.9926   0.9947
280   0.9929   0.9949
290   0.9931   0.9951
300   0.9933   0.9952
310   0.9936   0.9954
320   0.9937   0.9955
330   0.9939   0.9956
340   0.9941   0.9957
350   0.9942   0.9958
360   0.9944   0.9959
370   0.9945   0.9960
380   0.9947   0.9961
390   0.9948   0.9962
400   0.9949   0.9963
410   0.9950   0.9964
420   0.9951   0.9965
430   0.9953   0.9966
440   0.9954   0.9966
450   0.9954   0.9967
460   0.9955   0.9968
470   0.9956   0.9968
480   0.9957   0.9969
490   0.9958   0.9969
500   0.9959   0.9970
525   0.9961   0.9972
550   0.9963   0.9973
575   0.9964   0.9974
600   0.9965   0.9975
625   0.9967   0.9976
650   0.9968   0.9977

								
To top