VIEWS: 89 PAGES: 27 CATEGORY: Business POSTED ON: 8/31/2010 Public Domain
SW388R7 Data Analysis & Computers II Assumption of normality Slide 1 Assumption of normality Transformations Assumption of normality script Practice problems SW388R7 Data Analysis & Computers II Assumption of Normality Slide 2 Many of the statistical methods that we will apply require the assumption that a variable or variables are normally distributed. With multivariate statistics, the assumption is that the combination of variables follows a multivariate normal distribution. Since there is not a direct test for multivariate normality, we generally test each variable individually and assume that they are multivariate normal if they are individually normal, though this is not necessarily the case. SW388R7 Data Analysis & Computers II Evaluating normality Slide 3 There are both graphical and statistical methods for evaluating normality. Graphical methods include the histogram and normality plot. Statistical methods include diagnostic hypothesis tests for normality, and a rule of thumb that says a variable is reasonably close to normal if its skewness and kurtosis have values between –1.0 and +1.0. None of the methods is absolutely definitive. SW388R7 Data Analysis & Computers II Transformations Slide 4 When a variable is not normally distributed, we can create a transformed variable and test it for normality. If the transformed variable is normally distributed, we can substitute it in our analysis. Three common transformations are: the logarithmic transformation, the square root transformation, and the inverse transformation. All of these change the measuring scale on the horizontal axis of a histogram to produce a transformed variable that is mathematically equivalent to the original variable. SW388R7 Data Analysis & Computers II When transformations do not work Slide 5 When none of the transformations induces normality in a variable, including that variable in the analysis will reduce our effectiveness at identifying statistical relationships, i.e. we lose power. We do have the option of changing the way the information in the variable is represented, e.g. substitute several dichotomous variables for a single metric variable. SW388R7 Data Analysis & Computers II Problem 1 Slide 6 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Use 0.01 as the level of significance. Based on a diagnostic hypothesis test of normality, total hours spent on the Internet is normally distributed. 1. True 2. True with caution 3. False 4. Incorrect application of a statistic SW388R7 Data Analysis & Computers II Computing “Explore” descriptive statistics Slide 7 To compute the statistics needed for evaluating the normality of a variable, select the Explore… command from the Descriptive Statistics menu. SW388R7 Data Analysis & Computers II Adding the variable to be evaluated Slide 8 Second, click on right arrow button to move the highlighted variable to the Dependent List. First, click on the variable to be included in the analysis to highlight it. SW388R7 Data Analysis & Computers II Selecting statistics to be computed Slide 9 To select the statistics for the output, click on the Statistics… command button. SW388R7 Data Analysis & Computers II Including descriptive statistics Slide 10 First, click on the Descriptives checkbox to select it. Clear the other checkboxes. Second, click on the Continue button to complete the request for statistics. SW388R7 Data Analysis & Computers II Selecting charts for the output Slide 11 To select the diagnostic charts for the output, click on the Plots… command button. SW388R7 Data Analysis & Computers II Including diagnostic plots and statistics Slide 12 First, click on the None option button on the Boxplots panel since boxplots are not as helpful as other charts in assessing normality. Finally, click on the Continue button to complete the request. Second, click on the Normality plots with tests Third, click on the Histogram checkbox to include checkbox to include a normality plots and the histogram in the output. You hypothesis tests for may want to examine the normality. stem-and-leaf plot as well, though I find it less useful. SW388R7 Data Analysis & Computers II Completing the specifications for the analysis Slide 13 Click on the OK button to complete the specifications for the analysis and request SPSS to produce the output. SW388R7 Data Analysis & Computers II The histogram Slide 14 Histogram An initial impression of the normality of the distribution 50 can be gained by examining the histogram. 40 In this example, the histogram shows a substantial violation of normality caused 30 by a extremely large value in the distribution. 20 Frequency 10 Std. Dev = 15.35 Mean = 10.7 0 N = 93.00 0.0 20.0 40.0 60.0 80.0 100.0 10.0 30.0 50.0 70.0 90.0 TOTAL TIME SPENT ON THE INTERNET SW388R7 Data Analysis & Computers II The normality plot Slide 15 Normal Q-Q Plot of TOTAL TIME SPENT ON THE INTERNET 3 2 1 0 The problem with the normality of this Expected Normal -1 variable’s distribution is reinforced by the normality plot. -2 If the variable were normally distributed, the red dots would fit the green line very closely. In this case, the red points in the -3 upper right of the chart indicate the -40 -20 0 20 40 60 80 100 120 severe skewing caused by the extremely large data values. Observed Value SW388R7 Data Analysis & Computers II The test of normality Slide 16 Tests of Normality a Kolmogorov-Smirnov Shapiro-Wilk Statistic df Sig. Statistic df Sig. TOTAL TIME SPENT .246 93 .000 .606 93 .000 ON THE INTERNET a. Lilliefors Significance Correction Problem 1 asks about the results of the test of normality. Since the sample size is larger than 50, we use the Kolmogorov-Smirnov test. If the sample size were 50 or less, we would use the Shapiro-Wilk statistic instead. The null hypothesis for the test of normality states that the actual distribution of the variable is equal to the expected distribution, i.e., the variable is normally distributed. Since the probability associated with the test of normality is < 0.001 is less than or equal to the level of significance (0.01), we reject the null hypothesis and conclude that total hours spent on the Internet is not normally distributed. (Note: we report the probability as <0.001 instead of .000 to be clear that the probability is not really zero.) The answer to problem 1 is false. SW388R7 Data Analysis & Computers II The assumption of normality script Slide 17 An SPSS script to produce all of the output that we have produced manually is available on the course web site. After downloading the script, run it to test the assumption of linearity. Select Run Script… from the Utilities menu. SW388R7 Data Analysis & Computers II Selecting the assumption of normality script Slide 18 First, navigate to the folder containing your scripts and highlight the NormalityAssumptionAndTransformations.SBS script. Second, click on the Run button to activate the script. SW388R7 Data Analysis & Computers II Specifications for normality script Slide 19 First, move variables from the list of variables in the data set to the Variables to Test list box. The default output is to do all of the transformations of the variable. To exclude some transformations from the Third, click on the OK calculations, clear the checkboxes. button to run the script. SW388R7 Data Analysis & Computers II The test of normality Slide 20 Tests of Normality a Kolmogorov-Smirnov Shapiro-Wilk Statistic df Sig. Statistic df Sig. TOTAL TIME SPENT .246 93 .000 .606 93 .000 ON THE INTERNET a. Lilliefors Significance Correction The script produces the same output that we computed manually, in this example, the tests of normality. SW388R7 Data Analysis & Computers II Problem 2 Slide 21 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Based on the rule of thumb for the allowable magnitude of skewness and kurtosis, total hours spent on the Internet is normally distributed. 1. True 2. True with caution 3. False 4. Incorrect application of a statistic SW388R7 Data Analysis & Computers II Table of descriptive statistics Slide 22 Descriptiv es Statistic Std. Error TOTAL TIME SPENT Mean 10.731 1.5918 ON THE INTERNET 95% Confidence Lower Bound 7.570 Interval for Mean Upper Bound 13.893 5% Trimmed Mean 8.295 Median 5.500 To answer problem Variance 235.655 2, we look at the Std. Deviation 15.3511 values for skewness Minimum .2 and kurtosis in the Maximum 102.0 Descriptives table. Range 101.8 Interquartile Range 10.200 Skewness 3.532 .250 Kurtosis 15.614 .495 The skewness and kurtosis for the variable both exceed the rule of thumb criteria of 1.0. The variable is not normally distributed. The answer to problem 2 if false. SW388R7 Data Analysis & Computers II Problem 3 Slide 23 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Use 0.01 as the level of significance. Based on a diagnostic hypothesis test of normality, "total hours spent on the Internet" is not normally distributed. A logarithmic transformation of "total hours spent on the Internet" results in a variable that is normally distributed. 1. True 2. True with caution 3. False 4. Incorrect application of a statistic SW388R7 Data Analysis & Computers II The test of normality Slide 24 Tests of Normality a Kolmogorov-Smirnov Shapiro-Wilk Statistic df Sig. Statistic df Sig. Logarithm of NETIME .047 93 .200* .994 93 .951 [LG10(NETIME)] Square Root of NETIME .118 93 .003 .868 93 .000 [SQRT(NETIME)] Inverse of NETIME .288 93 .000 .495 93 .000 [1/(NETIME)] *. This is a lower bound of the true significance. a. Lilliefors Significance Correction Problem 3 specifically asks about the results of the test of normality for the logarithmic transformation. Since our sample size is larger than 50, we use the Kolmogorov-Smirnov test. The null hypothesis for the Kolmogorov-Smirnov test of normality states that the actual distribution of the transformed variable is equal to the expected distribution, i.e., the transformed variable is normally distributed. Since the probability associated with the test of normality (0.200) is greater than the level of significance, we fail to reject the null hypothesis and conclude that the logarithmic transformation of total hours spent on the Internet is normally distributed. The answer to problem 3 is true. SW388R7 Data Analysis & Computers II Other problems on assumption of normality Slide 25 A problem may ask about the assumption of normality for a nominal level variable. The answer will be “An inappropriate application of a statistic” since there is no expectation that a nominal variable be normal. A problem may ask about the assumption of normality for an ordinal level variable. If the variable or transformed variable is normal, the correct answer to the question is “True with caution” since we may be required to defend treating an ordinal variable as metric. Questions will specify a level of significance to use and the statistical evidence upon which you should base your answer. SW388R7 Data Analysis & Steps in answering questions about the assumption of normality – question 1 Computers II Slide 26 The following is a guide to the decision process for answering problems about the normality of a variable: Is the variable to be No Incorrect application evaluated metric? of a statistic Yes Does the statistical No evidence support False normality assumption? Yes No Are any of the metric True variables ordinal level? Yes True with caution SW388R7 Data Analysis & Steps in answering questions about the assumption of normality – question 2 Computers II Slide 27 The following is a guide to the decision process for answering problems about the normality of a transformation: Is the variable to be No Incorrect application evaluated metric? of a statistic Yes Statistical evidence No Statistical evidence No supports normality? for transformation False supports normality? Yes No Either variable ordinal level? True Yes True with caution