Regression Analysis with SPSS by svh16277

VIEWS: 6,233 PAGES: 96

									Regression Analysis
   with SPSS
             Robert A. Yaffee, Ph.D.
 Statistics, Mapping and Social Science Group
         Academic Computing Services
        Information Technology Services
              New York University
         Office: 75 Third Ave Level C3
                Tel: 212.998.3402
             E-mail: yaffee@nyu.edu
                   February 04




                                                1
               Outline
1. Conceptualization
2. Schematic Diagrams of Linear Regression
   processes
3. Using SPSS, we plot and test relationships
   for linearity
4. Nonlinear relationships are transformed to
   linear ones
5. General Linear Model
6. Derivation of Sums of Squares and ANOVA
   Derivation of intercept and regression
   coefficients
7. The Prediction Interval and its derivation
8. Model Assumptions
    1. Explanation
    2. Testing
    3. Assessment
9. Alternatives when assumptions are
   unfulfilled

                                            2
  Conceptualization of
  Regression Analysis
• Hypothesis testing
• Path Analytical Decomposition
  of effects




                                  3
     Hypothesis Testing

• For example: hypothesis 1 : X is
  statistically significantly related to
  Y.
   – The relationship is positive (as X
     increases, Y increases) or negative
     (as X decreases, Y increases).
   – The magnitude of the relationship is
     small, medium, or large.
     If the magnitude is small, then a unit
     change in x is associated with a
     small change in Y.


                                          4
          Regression Analysis
Have a clear notion of what you can and
  cannot do with regression analysis

• Conceptualization
  – A Path Model of a Regression
    Analysis

        Path Diagram of A Linear Regression
                     Analysis



   X1



                                              error
                                      YY
   X2




   x3
                 Yi  k  b1 x1  b2 x2  b3 x3  ei

                                                       5
                            A Path Analysis
              Decomposition of Effects into Direct,
              Indirect, Spurious, and Total Effects

                                        Error




                                                                                                          Error

                            X2
                                                             C



                                                                                           Y3
                        A




 X1
                                                         E                           F



                        B

                                                                          Y2
                                                     D

                                                                                            Error



                            Y1


                                                                 Error




      Direct Effects:            Indirect Effects:                 Total Effects:
      Paths C, E, F                   Paths                      Sum of Direct and        Spurious effects are due to
                                   AC, BE, DF                     Indirect Effects       common (antecedent) causes




In a path analysis, Yi is endogenous. It is
the outcome of several paths.
Direct effects on Y3: C,E, F
Indirect effects on Y3: BF, BDF                                                                                         6

Total Effects= Direct + Indirect effects
        Interaction Analysis




   X1
                        A




              C                        Y

                            B




   X2




           Y= K + aX1 + BX2 + CX1*X2




Interaction coefficient: C
X1 and X2 must be in model for
interaction to be properly specified.
                                           7
 A Precursor to Modeling
     with Regression
• Data Exploration: Run a
  scatterplot matrix and search
  for linear relationships with the
  dependent variable.




                                      8
Click on graphs and
   then on scatter




                      9
   When the scatterplot
dialog box appears, select
          Matrix




                         10
A Matrix of Scatterplots
      will appear




Search for distinct linear relationships

                                           11
12
13
Decomposition of the
 Sums of Squares




                       14
    Graphical Decomposition
           of Effects

                        Decomposition of Effects

                             Y


                                       
                                        Yi                       y  a  bx
                                                                 ˆ

                                             yi  yi  error
                                                  ˆ
yi  y  Total Effect



                         Y
                                             y  y  regression effect
                                              ˆ




                                                                              X
                                    X




                                                                              15
         Decomposition of the
           sum of squares

            ˆ
Y  Y  Y Y  Y  Y   ˆ
total effect  error effects  regression (model ) effect
            ˆ           ˆ
Yi  Y  Yi  Yi  Yi  Y per case i
               ˆ            ˆ
(Y  Y ) 2  (Y  Y ) 2  (Y  Y ) 2 per case i
     i               i       i               i
 n                       n                            n

 (Y  Y )
i 1
         i
             2
                     (Yˆ  Y )
                     i 1
                                 i   i
                                         2
                                                     (Yˆ  Y )
                                                     i 1
                                                            i
                                                                   2
                                                                       for data set




                                                                                 16
Decomposition of the sum
      of squares
• Total SS = model SS + error SS
   and if we divide by df

  n                   n                   n

  (Yi  Y ) 2        ˆ
                      (Yi  Yi ) 2        ˆ
                                          (Yi  Y ) 2
 i 1
                    i 1
                                        i 1

        n1               nk 1                k


• This yields the Variance Decomposition:
  We have the total variance= model
  variance + error variance




                                                    17
F test for significance and
R2 for magnitude of effect
• R2 = Model var/total var
                           n

                           ˆ
                           (Yi  Y ) 2
                          i 1

            R2               k
                          n

                           ˆ
                            (Y  Y ) 2
                          i 1
                                 i   i


                            nk 1

•F test for model significance
    = Model Var/Error Var
                                 2
                             R
      F( k ,n  k 1)        k
                           1  R2
                          nk 1
                                         18
ANOVA tests the
significance of the
Regression Model




                      19
     The Multiple
  Regression Equation
• We proceed to the derivation of its
  components:
  – The intercept: a
  – The regression parameters, b1 and b2



   Yi  a  b1 x1  b2 x2  ei




                                        20
Derivation of the Intercept
  y  a  bx  e
  e  y  a  bx
    n                     n                      n                      n

  e
  i 1
         i            y a
                      i 1
                                     i
                                                i 1
                                                       i     b xi
                                                                       i 1
                                                                  n
  Because by definition  ei  0
                                                                i 1
             n                       n                      n
  0      y a
         i 1
                      i
                                   i 1
                                            i      b xi
                                                           i 1

   n          n                n

   ai   yi  b  xi
  i 1       i 1             i 1

                  n                        n
  na   yi  b  xi
              i 1                        i 1

  a  y  bx
                                                                              21
  Derivation of the
Regression Coefficient
  Given : yi  a  b xi  ei
  ei  yi  a  b xi
   n                       n

  e
  i 1
              i       (y
                        i 1
                                       i      a  b xi )
   n                           n

   ei 
  i 1
                  2
                            ( yi  a  b xi ) 2
                           i 1
         n
    ei 2                                        n                n
       i 1
                            2 xi  ( yi )  2b  xi xi
         b                                     i 1             i 1
                                            n                n
  0                    2 xi  ( yi )  2b  xi xi
                                           i 1             i 1
                       n

                      x y         i   i
  b                  i 1
                         n

                       xi 2
                      i 1                                              22
• If we recall that the formula for
  the correlation coefficient can
  be expressed as follows:




                                  23
                               n

                            x          i   yi
   r                        i 1


                     x   y 
                     n                      n
                                    2                    2
                                i                    i
                    i 1                i 1



   where
   x  xi  x
    y  yi  y

                                            n

                                                        xi yi
     bj                                i 1
                                           n

                                            
                                            i 1
                                                             x2

from which it can be seen that the regression coefficient b,
                                      is a function of r.




                                                 sd y
       bj  r *
                                                 sd x             24
Extending the bivariate case
To the Multiple linear regression case




                                         25
                ryx1  ryx2 rx1x2         sd y
 yx . x                             *           (6)
    1       2
                   1 r   2
                              x1 x2       sd x

                ryx2  ryx1 rx1x2         sd y
 yx . x                             *          (7)
   2    1
                  1 r   2
                             x1 x2        sd x



  It is also easy to extend the bivariate intercept
  to the multivariate case as follows.



 a  Y  b1 x1  b2 x2 (8)




                                                        26
Significance Tests for the
 Regression Coefficients
1. We find the significance of the
   parameter estimates by using the
   F or t test.

2. The R2 is the proportion of
   variance explained.
                       2  (n-1) 
3.Adjusted R  = 1-(1-R ) 
            2
                                 
                          (n-p-1) 
 where n  sample size
       p  number of parameters in model


                                       27
   F and T tests for
significance for overall
         model
     Model variance
  F
     error variance
                  R2 / p
    
          (1  R 2 ) /(n  p  1)
  where
  p  number of parameters
  n  sample size

   t      F
             ( n  2) * r   2
      
                 1 r 2

                                    28
     Significance tests

• If we are using a type II sum of
  squares, we are dealing with
  the ballantine. DV Variance
  explained = a + b




                                     29
    Significance tests

T tests for statistical significance



                  0
            t
                  sea
               b0
            t
                seb



                                   30
            Significance tests

  Standard Error of intercept

                                                      
SEa
        
       
            (Y  Y ) 2   1
                       * 
                                      xi 2            
                                                       
            n2         n ( n  1)                  
                      
                         
                         
                                       ( xi  x ) 2
                                                       
                                                       

  Standard error of regression coefficient

                  ˆ
 SEb 
             x            2



 where   std dev of residual
        ˆ
              n

            e         2


  
  ˆ   2      i 1                                 31

            n2
Programming Protocol
After invoking SPSS, procede to File, Open, Data




                                                   32
Select a Data Set (we
choose employee.sav)
  and click on open




                        33
We open the data set




                       34
 To inspect the variable
formats, click on variable
  view on the lower left




                             35
    Because gender is a
string variable, we need to
    recode gender into a
       numeric format




                          36
We autorecode gender by
clicking on transform and
     then autorecode




                        37
We select gender and
move it into the variable
   box on the right




                            38
Give the variable a new
name and click on add
       new name




                          39
       Click on ok and the
     numeric variable sex is
             created




It has values 1 for female and 2 for male and those values labels
are inserted.

                                                          40
To invoke Regression
      analysis,
  Click on Analyze




                       41
Click on Regression
   and then linear




                      42
 Select the dependent
variable: Current Salary




                       43
    Enter it in the
dependent variable box




                     44
 Entering independent
       variables
• These variables are entered in
  blocks. First the potentially
  confounding covariates that
  have to entered.
• We enter time on job,
  beginning salary, and previous
  experience.




                               45
  After entering the
covariates, we click on
         next




                          46
    We now enter the
  hypotheses we wish to
          test
• We are testing for minority or
  sex differences in salary after
  controlling for the time on job,
  previous experience, and
  beginning salary.
• We enter minority and numeric
  gender (sex)




                                 47
After entering these
 variables, click on
      statistics




                       48
 We select the following
statistics from the dialog
box and click on continue




                         49
Click on plots to obtain
 the plots dialog box




                       50
 We click on OK to run
the regression analysis




                      51
Navigation window (left)
and output window(right)
This shows that SPSS is reading the variables
correctly




                                                52
Variables Entered and
  Model Summary




                        53
     Omnibus ANOVA
Significance Tests for the Model at each stage of the
analysis




                                                        54
           Full Model
           Coefficients




CurSal   12036.3  1.83BeginSal
          165.17Jobtime  23.64 Exper
          2882.84 gender  1419.7 Minority




                                              55
    We omit insignificant variables and
   rerun the analysis to obtain trimmed
            model coefficients




CurSal   12126.5  1.85BeginSal
          163.20Jobtime  24.36 Exper
          2694.30 gender
                                         56
       Beta weights

• These are standardized
  regression coefficients used to
  compare the contribution to the
  explanation of the variance of
  the dependent variable within
  the model.




                                57
   T tests and signif.

• These are the tests of
  significance for each
  parameter estimate.

• The significance levels have to
  be less than .05 for the
  parameter to be statistically
  significant.




                                58
Assumptions of the Linear
   Regression Model
1.   Linear Functional form
2.   Fixed independent variables
3.   Independent observations
4.   Representative sample and proper
     specification of the model (no
     omitted variables)
5.   Normality of the residuals or errors
6.   Equality of variance of the errors
     (homogeneity of residual variance)
7.   No multicollinearity
8.   No autocorrelation of the errors
9.   No outlier distortion
                                            59
          Explanation of the
            Assumptions
1.   1. Linear Functional form
     1. Does not detect curvilinear relationships
2.   Independent observations
     1. Representative samples
     2. Autocorrelation inflates the t and r and f statistics and
        warps the significance tests
3.   Normality of the residuals
     1. Permits proper significance testing
4.   Equality of variance
     1. Heteroskedasticity precludes generalization and
        external validity
     2. This also warps the significance tests
5.   Multicollinearity prevents proper parameter
     estimation. It may also preclude computation of the
     parameter estimates completely if it is serious enough.
6.   Outlier distortion may bias the results: If outliers
     have high influence and the sample is not large
     enough, then they may serious bias the parameter
     estimates




                                                                    60
 Diagnostic Tests for the
 Regression Assumptions
1.   Linearity tests: Regression curve fitting
     1. No level shifts: One regime
2.   Independence of observations: Runs test
3.   Normality of the residuals: Shapiro-Wilks or
     Kolmogorov-Smirnov Test
4.   Homogeneity of variance if the residuals: White’s
     General Specification test
5.   No autocorrelation of residuals: Durbin Watson or
     ACF or PACF of residuals
6.   Multicollinearity: Correlation matrix of independent
     variables.. Condition index or condition number
7.   No serious outlier influence: tests of additive outliers:
     Pulse dummies.
     1.   Plot residuals and look for high leverage of residuals
     2.   Lists of Standardized residuals
     3.   Lists of Studentized residuals
     4.   Cook’s distance or leverage statistics




                                                                   61
       Explanation of
        Diagnostics
1. Plots show linearity or
   nonlinearity of relationship
2. Correlation matrix shows
   whether the independent
   variables are collinear and
   correlated.
3. Representative sample is done
   with probability sampling




                                   62
       Explanation of
        Diagnostics
Tests for Normality of the
 residuals. The residuals are
 saved and then subjected to
 either of:
  Kolmogorov-Smirnov Test: Tests
   the limit of the theoretical
   cumulative normal distribution
   against your residual distribution.
  Nonparametric Tests
   1 sample K-S test



                                    63
  Collinearity Diagnostics

Tolerance    1R   2


   small tolerances imply problems
Variance Inflation Factor (VIF)
                 1
          
             Tolerance
Small intercorrelations among indep vars
          means VIF  1
VIF  10 signifies problems




                                      64
       More Collinearity
         Diagnostics
condition numbers
   = maximum
  eigenvalue/minimum
  eigenvalue.
  If condition numbers are between
     100 and 1000, there is moderate
     to strong collinearity



   condition index  k
   where k  condition number
 If Condition index > 30 then there is strong collinearity


                                                             65
 Outlier Diagnostics
1. Residuals.
  1. The predicted value minus the actual
     value. This is otherwise known as the
     error.
2. Studentized Residuals
  1. the residuals divided by their
     standard errors without the ith
     observation
3. Leverage, called the Hat diag
  1. This is the measure of influence of
     each observation
4. Cook’s Distance:
  1. the change in the statistics that
     results from deleting the observation.
     Watch this if it is much greater than
     1.0.


                                           66
     Outlier detection

• Outlier detection involves the
  determination whether the
  residual (error = predicted –
  actual) is an extreme negative
  or positive value.
• We may plot the residual
  versus the fitted plot to
  determine which errors are
  large, after running the
  regression.


                                   67
  Create Standardized
       Residuals
• A standardized residual is one
  divided by its standard deviation.


                     yi  yi
                     ˆ
resid standardized 
                        s
where s  std dev of residuals




                                       68
 Limits of Standardized
        Residuals
If the standardized residuals
   have values in excess of 3.5
   and -3.5, they are outliers.
If the absolute values are less
   than 3.5, as these are, then
   there are no outliers
While outliers by themselves
   only distort mean prediction
   when the sample size is small
   enough, it is important to
   gauge the influence of outliers.
                                  69
     Outlier Influence

• Suppose we had a different
  data set with two outliers.
• We tabulate the standardized
  residuals and obtain the
  following output:




                                 70
Outlier a does not distort
  and outlier b does.




                             71
   Studentized Residuals

• Alternatively, we could form
  studentized residuals. These are
  distributed as a t distribution with
  df=n-p-1, though they are not
  quite independent. Therefore, we
  can approximately determine if
  they are statistically significant or
  not.
• Belsley et al. (1980)
  recommended the use of
  studentized residuals.

                                      72
 Studentized Residual


                  ei
 ei 
    s

            s 2 (i ) (1  hi )
  where
 ei s  studentized residual
  s(i )  standard deviation where ith obs is deleted
  hi  leverage statistic

These are useful in estimating the statistical significance
of a particular observation, of which a dummy variable
indicator is formed. The t value of the studentized residual
will indicate whether or not that observation is a significant
outlier.
The command to generate studentized residuals, called rstudt is:
predict rstudt, rstudent
                                                             73
     Influence of Outliers

 1. Leverage is measured by the
    diagonal components of the hat
    matrix.
 2. The hat matrix comes from the
    formula for the regression of Y.

 ˆ
Y  X   X '( X ' X ) 1 X ' Y
where X '( X ' X ) 1 X '  the hat matrix, H
Therefore,
 ˆ
Y  HY


                                          74
     Leverage and the Hat
           matrix
1.   The hat matrix transforms Y into the
     predicted scores.
2.   The diagonals of the hat matrix indicate
     which values will be outliers or not.
3.   The diagonals are therefore measures of
     leverage.
4.   Leverage is bounded by two limits: 1/n and
     1. The closer the leverage is to unity, the
     more leverage the value has.
5.   The trace of the hat matrix = the number of
     variables in the model.
6.   When the leverage > 2p/n then there is high
     leverage according to Belsley et al. (1980)
     cited in Long, J.F. Modern Methods of
     Data Analysis (p.262). For smaller samples,
     Vellman and Welsch (1981) suggested that
     3p/n is the criterion.


                                              75
                  Cook’s D

1. Another measure of influence.
2. This is a popular one. The
   formula for it is:

               1   hi  ei 2            
Cook ' s Di               2           
               p   1  hi  s (1  hi ) 

 Cook and Weisberg(1982) suggested that values of
 D that exceeded 50% of the F distribution (df = p, n-p)
 are large.




                                                           76
    Using Cook’s D in
         SPSS
• Cook is the option /R
• Finding the influential outliers
• List cook, if cook > 4/n
• Belsley suggests 4/(n-k-1) as a cutoff




                                           77
             DFbeta

• One can use the DFbetas to
  ascertain the magnitude of
  influence that an observation has
  on a particular parameter estimate
  if that observation is deleted.
               b j  b(i ) j u j
DFbeta j 
              u
                         2
                     j
                             (1  h j )
where u j  residuals of
regression of x on remaining xs.
                                          78
 Programming Diagnostic
          Tests
Testing homoskedasiticity
Select histogram, normal probability plot,
          and insert *zresid in Y
             and *zpred in X




    Then click on continue

                                             79
Click on Save to obtain
 the Save dialog box




                      80
We select the following




 Then we click on continue, go back to the Main
 Regression Menu and click on OK
                                                  81
     Check for linear
     Functional Form
• Run a matrix plot of the
  dependent variable against
  each independent variable to
  be sure that the relationship is
  linear.




                                     82
 Move the variables to be graphed
into the box on the upper right, and
            click on OK




                                   83
      Residual
Autocorrelation check


Durbin  Watson d
tests first  order
autocorrelation of residuals

d 
      n
            et  et 1  2
    i 1          et

   See significance tables for this
   statistic


                                      84
Run the autocorrelation function from
the Trends Module for a better analysis




                                          85
Testing for Homogeneity of variance




                                      86
Normality of residuals can be visually
inspected from the histogram with the
superimposed normal curve.
Here we check the skewness for
symmetry and the kurtosis for
peakedness




                                         87
Kolmogorov Smirnov Test: An
objective test of normality




                              88
89
90
Multicollinearity test with the
correlation matrix




                                  91
92
93
 Alternatives to Violations
      of Assumptions
• 1. Nonlinearity: Transform to linearity if there is
  nonlinearity or run a nonlinear regression
• 2. Nonnormality: Run a least absolute deviations
  regression or a median regression (available in
  other packages or generalized linear models [
  SPLUS glm, STATA glm, or SAS Proc MODEL or
  PROC GENMOD)].
• 3. Heteroskedasticity: weighted least squares
  regression (SPSS) or white estimator (SAS,
  Stata, SPLUS). One can use a robust regression
  procedure (SAS, STATA, or SPLUS) to obtain
  downweighted outlier effect in the estimation.
• 4. Autocorrelation: Run AREG in SPSS Trends
  module or either Prais or Newey-West procedure
  in STATA.
• 4. Multicollinearity: components regression or
  ridge regression or proxy variables. 2sls in SPSS
  or ivreg in stata or SAS proc model or proc syslin.




                                                   94
      Model Building
       Strategies
• Specific to General: Cohen
  and Cohen
• General to Specific: Hendry
  and Richard
• Extreme Bounds analysis: E.
  Leamer.




                                95
       Nonparametric
        Alternatives
1. If there is nonlinearity, transform
   to linearity first.
2. If there is heteroskedasticity, use
   robust standard errors with
   STATA or SAS or SPLUS.
3. If there is non-normality, use
   quantile regression with
   bootstrapped standard errors in
   STATA or SPLUS.
4. If there is autocorrelation of
   residuals, use Newey-West
   autoregression or First order
   autocorrelation correction with
   Areg. If there is higher order
   autocorrelation, use Box Jenkins
   ARIMA modeling.

                                         96

								
To top