Assumption of normality - PowerPoint - PowerPoint

Document Sample
Assumption of normality - PowerPoint - PowerPoint Powered By Docstoc
					   SW388R7
Data Analysis &
 Computers II     Assumption of normality
    Slide 1




                     Assumption of normality

                         Transformations

                  Assumption of normality script

                        Practice problems
   SW388R7
Data Analysis &
 Computers II                   Assumption of Normality
    Slide 2




                     Many of the statistical methods that we will apply
                      require the assumption that a variable or variables
                      are normally distributed.

                     With multivariate statistics, the assumption is that
                      the combination of variables follows a multivariate
                      normal distribution.

                     Since there is not a direct test for multivariate
                      normality, we generally test each variable
                      individually and assume that they are multivariate
                      normal if they are individually normal, though this is
                      not necessarily the case.
   SW388R7
Data Analysis &
 Computers II                      Evaluating normality
    Slide 3




                     There are both graphical and statistical methods for
                      evaluating normality.

                     Graphical methods include the histogram and
                      normality plot.

                     Statistical methods include diagnostic hypothesis
                      tests for normality, and a rule of thumb that says a
                      variable is reasonably close to normal if its skewness
                      and kurtosis have values between –1.0 and +1.0.

                     None of the methods is absolutely definitive.
   SW388R7
Data Analysis &
 Computers II                        Transformations
    Slide 4




                     When a variable is not normally distributed, we can
                      create a transformed variable and test it for
                      normality. If the transformed variable is normally
                      distributed, we can substitute it in our analysis.

                     Three common transformations are: the logarithmic
                      transformation, the square root transformation, and
                      the inverse transformation.

                     All of these change the measuring scale on the
                      horizontal axis of a histogram to produce a
                      transformed variable that is mathematically
                      equivalent to the original variable.
   SW388R7
Data Analysis &
 Computers II              When transformations do not work
    Slide 5




                     When none of the transformations induces normality
                      in a variable, including that variable in the analysis
                      will reduce our effectiveness at identifying statistical
                      relationships, i.e. we lose power.

                     We do have the option of changing the way the
                      information in the variable is represented, e.g.
                      substitute several dichotomous variables for a single
                      metric variable.
   SW388R7
Data Analysis &
 Computers II                          Problem 1
    Slide 6




                  In the dataset GSS2000.sav, is the following
                  statement true, false, or an incorrect application of
                  a statistic? Use 0.01 as the level of significance.

                  Based on a diagnostic hypothesis test of normality,
                  total hours spent on the Internet is normally
                  distributed.

                  1.   True
                  2.   True with caution
                  3.   False
                  4.   Incorrect application of a statistic
   SW388R7
Data Analysis &
 Computers II     Computing “Explore” descriptive statistics
    Slide 7




                                             To compute the statistics
                                             needed for evaluating the
                                             normality of a variable, select
                                             the Explore… command from
                                             the Descriptive Statistics
                                             menu.
   SW388R7
Data Analysis &
 Computers II      Adding the variable to be evaluated
    Slide 8




                                            Second, click on right
                                            arrow button to move
                                            the highlighted variable
                                            to the Dependent List.


                  First, click on the
                  variable to be included
                  in the analysis to
                  highlight it.
   SW388R7
Data Analysis &
 Computers II     Selecting statistics to be computed
    Slide 9




                                        To select the statistics for the
                                        output, click on the
                                        Statistics… command button.
   SW388R7
Data Analysis &
 Computers II     Including descriptive statistics
   Slide 10




                  First, click on the
                  Descriptives checkbox
                  to select it. Clear the
                  other checkboxes.




                                            Second, click on the
                                            Continue button to
                                            complete the request for
                                            statistics.
   SW388R7
Data Analysis &
 Computers II     Selecting charts for the output
   Slide 11




                                        To select the diagnostic charts
                                        for the output, click on the
                                        Plots… command button.
   SW388R7
Data Analysis &
 Computers II               Including diagnostic plots and statistics
   Slide 12




                  First, click on the
                  None option button
                  on the Boxplots panel
                  since boxplots are not
                  as helpful as other
                  charts in assessing
                  normality.




                                                                        Finally, click on the
                                                                        Continue button to
                                                                        complete the request.



                     Second, click on the
                     Normality plots with tests   Third, click on the Histogram
                     checkbox to include          checkbox to include a
                     normality plots and the      histogram in the output. You
                     hypothesis tests for         may want to examine the
                     normality.                   stem-and-leaf plot as well,
                                                  though I find it less useful.
   SW388R7
Data Analysis &
 Computers II     Completing the specifications for the analysis
   Slide 13




                                                Click on the OK button to
                                                complete the specifications
                                                for the analysis and request
                                                SPSS to produce the
                                                output.
   SW388R7
Data Analysis &
 Computers II                                          The histogram
   Slide 14




                                   Histogram                                              An initial impression of the
                                                                                          normality of the distribution
                              50
                                                                                          can be gained by examining
                                                                                          the histogram.

                              40                                                          In this example, the
                                                                                          histogram shows a substantial
                                                                                          violation of normality caused
                              30                                                          by a extremely large value in
                                                                                          the distribution.

                              20
                  Frequency




                              10
                                                                                                                Std. Dev = 15.35
                                                                                                                Mean = 10.7
                              0                                                                                 N = 93.00
                                   0.0          20.0          40.0          60.0          80.0          100.0
                                         10.0          30.0          50.0          70.0          90.0

                                   TOTAL TIME SPENT ON THE INTERNET
                SW388R7
             Data Analysis &
              Computers II                               The normality plot
                  Slide 15




                                    Normal Q-Q Plot of TOTAL TIME SPENT ON THE INTERNET
                               3


                               2


                               1


                               0

                                                                    The problem with the normality of this
Expected Normal




                               -1                                   variable’s distribution is reinforced by the
                                                                    normality plot.

                               -2                                    If the variable were normally distributed,
                                                                     the red dots would fit the green line very
                                                                     closely. In this case, the red points in the
                               -3
                                                                     upper right of the chart indicate the
                                 -40      -20        0   20   40   60       80     100    120
                                                                     severe skewing caused by the extremely
                                                                     large data values.
                                    Observed Value
   SW388R7
Data Analysis &
 Computers II                            The test of normality
   Slide 16



                                                             Tests of Normality
                                                                        a
                                                    Kolmogorov-Smirnov                       Shapiro-Wilk
                                              Statistic    df         Sig.         Statistic     df         Sig.
                       TOTAL TIME SPENT
                                                  .246           93         .000       .606         93        .000
                       ON THE INTERNET
                         a. Lilliefors Significance Correction




                  Problem 1 asks about the results of the test of normality. Since the sample
                  size is larger than 50, we use the Kolmogorov-Smirnov test. If the sample
                  size were 50 or less, we would use the Shapiro-Wilk statistic instead.

                  The null hypothesis for the test of normality states that the actual
                  distribution of the variable is equal to the expected distribution, i.e., the
                  variable is normally distributed. Since the probability associated with the
                  test of normality is < 0.001 is less than or equal to the level of significance
                  (0.01), we reject the null hypothesis and conclude that total hours spent on
                  the Internet is not normally distributed. (Note: we report the probability as
                  <0.001 instead of .000 to be clear that the probability is not really zero.)

                  The answer to problem 1 is false.
   SW388R7
Data Analysis &
 Computers II           The assumption of normality script
   Slide 17




                  An SPSS script to produce all
                  of the output that we have
                  produced manually is
                  available on the course web
                  site.

                  After downloading the script,
                  run it to test the assumption
                  of linearity.
                                                  Select Run Script…
                                                  from the Utilities
                                                  menu.
   SW388R7
Data Analysis &
 Computers II     Selecting the assumption of normality script
   Slide 18




                                       First, navigate to the folder containing your
                                       scripts and highlight the
                                       NormalityAssumptionAndTransformations.SBS
                                       script.




                                 Second, click on
                                 the Run button to
                                 activate the script.
   SW388R7
Data Analysis &
 Computers II               Specifications for normality script
   Slide 19




                                                           First, move variables from
                                                           the list of variables in the
                                                           data set to the Variables to
                                                           Test list box.




                  The default output is to do all of the
                  transformations of the variable. To
                  exclude some transformations from the              Third, click on the OK
                  calculations, clear the checkboxes.                button to run the script.
   SW388R7
Data Analysis &
 Computers II                      The test of normality
   Slide 20




                                                       Tests of Normality
                                                                  a
                                              Kolmogorov-Smirnov                       Shapiro-Wilk
                                        Statistic    df         Sig.         Statistic     df         Sig.
                  TOTAL TIME SPENT
                                            .246           93         .000       .606         93        .000
                  ON THE INTERNET
                   a. Lilliefors Significance Correction


                                        The script produces the same output that we
                                        computed manually, in this example, the tests
                                        of normality.
   SW388R7
Data Analysis &
 Computers II                          Problem 2
   Slide 21




                  In the dataset GSS2000.sav, is the following
                  statement true, false, or an incorrect application of
                  a statistic?

                  Based on the rule of thumb for the allowable
                  magnitude of skewness and kurtosis, total hours
                  spent on the Internet is normally distributed.

                  1.   True
                  2.   True with caution
                  3.   False
                  4.   Incorrect application of a statistic
   SW388R7
Data Analysis &
 Computers II                Table of descriptive statistics
   Slide 22




                                                                 Descriptiv es

                                                                                        Statistic   Std. Error
                                 TOTAL TIME SPENT Mean                                   10.731       1.5918
                                 ON THE INTERNET 95% Confidence           Lower Bound      7.570
                                                  Interval for Mean       Upper Bound
                                                                                           13.893

                                                    5% Trimmed Mean                       8.295
                                                    Median                                5.500
                  To answer problem                 Variance                            235.655
                  2, we look at the                 Std. Deviation                      15.3511
                  values for skewness
                                                    Minimum                                  .2
                  and kurtosis in the
                                                    Maximum                               102.0
                  Descriptives table.
                                                    Range                                 101.8
                                                    Interquartile Range                  10.200
                                                    Skewness                              3.532          .250
                                                    Kurtosis                             15.614          .495



                      The skewness and kurtosis for the variable both exceed the rule of
                      thumb criteria of 1.0. The variable is not normally distributed.

                      The answer to problem 2 if false.
   SW388R7
Data Analysis &
 Computers II                          Problem 3
   Slide 23



                  In the dataset GSS2000.sav, is the following
                  statement true, false, or an incorrect application of
                  a statistic? Use 0.01 as the level of significance.
                  Based on a diagnostic hypothesis test of normality,
                  "total hours spent on the Internet" is not normally
                  distributed. A logarithmic transformation of "total
                  hours spent on the Internet" results in a variable that
                  is normally distributed.

                  1.   True
                  2.   True with caution
                  3.   False
                  4.   Incorrect application of a statistic
   SW388R7
Data Analysis &
 Computers II                                        The test of normality
   Slide 24


                                                        Tests of Normality
                                                                       a
                                                Kolmogorov-Smirnov                           Shapiro-Wilk
                                          Statistic    df         Sig.             Statistic     df         Sig.
              Logarithm of NETIME
                                              .047            93           .200*       .994         93        .951
              [LG10(NETIME)]
              Square Root of NETIME
                                              .118            93           .003        .868         93        .000
              [SQRT(NETIME)]
              Inverse of NETIME
                                              .288            93           .000        .495         93        .000
              [1/(NETIME)]
                  *. This is a lower bound of the true significance.
                  a. Lilliefors Significance Correction
                                            Problem 3 specifically asks about the results of the test of
                                           normality for the logarithmic transformation. Since our sample
                                           size is larger than 50, we use the Kolmogorov-Smirnov test.

                                           The null hypothesis for the Kolmogorov-Smirnov test of normality
                                           states that the actual distribution of the transformed variable is
                                           equal to the expected distribution, i.e., the transformed variable
                                           is normally distributed. Since the probability associated with the
                                           test of normality (0.200) is greater than the level of significance,
                                           we fail to reject the null hypothesis and conclude that the
                                           logarithmic transformation of total hours spent on the Internet is
                                           normally distributed.

                                           The answer to problem 3 is true.
   SW388R7
Data Analysis &
 Computers II      Other problems on assumption of normality
   Slide 25



                 A problem may ask about the assumption of normality
                  for a nominal level variable. The answer will be “An
                  inappropriate application of a statistic” since there is
                  no expectation that a nominal variable be normal.

                 A problem may ask about the assumption of normality
                  for an ordinal level variable. If the variable or
                  transformed variable is normal, the correct answer to
                  the question is “True with caution” since we may be
                  required to defend treating an ordinal variable as
                  metric.

                 Questions will specify a level of significance to use and
                  the statistical evidence upon which you should base
                  your answer.
   SW388R7
Data Analysis &      Steps in answering questions about the
                      assumption of normality – question 1
 Computers II

   Slide 26



                  The following is a guide to the decision process for answering
                  problems about the normality of a variable:

                                     Is the variable to be      No   Incorrect application
                                     evaluated metric?               of a statistic



                                                      Yes



                                     Does the statistical       No
                                     evidence support                False
                                     normality assumption?



                                                    Yes

                                                                No
                                     Are any of the metric            True
                                     variables ordinal level?


                                                    Yes

                                      True with caution
   SW388R7
Data Analysis &               Steps in answering questions about the
                               assumption of normality – question 2
 Computers II

   Slide 27



                        The following is a guide to the decision process for answering
                        problems about the normality of a transformation:

                  Is the variable to be     No     Incorrect application
                  evaluated metric?                of a statistic



                                  Yes



                  Statistical evidence
                                             No             Statistical evidence        No
                  supports normality?
                                                            for transformation                  False
                                                            supports normality?



                                                                        Yes

                                                                                   No
                                                              Either variable
                                                              ordinal level?                 True



                                                                        Yes

                                                            True with caution