Influence Statistics and Outliers by Zy2C9BH4

VIEWS: 11 PAGES: 2

									                 Influence Statistics, Outliers, and Collinearity Diagnostics

   Standardized Residuals – Residuals divided by their estimated standard errors (like t-statistics).

    Observations with values larger than t                            in absolute value are considered outliers.
                                                            ,n  p '
                                                       2n



                                      ei
                        ri                      s  MS Re sidual                  vii  i thh diagonal element of P
                                 s 1  vii
   Studentized Residuals – Similar to standardized residuals, except s(i )  MS Re sidual(i ) is computed on

    the regression fit on the remaining n-1 cases. Observations with values larger than t                                         in absolute
                                                                                                                    , n  p ' 1
                                                                                                               2n

    value are considered outliers

                                     ei
                    ri*                          s  MS Re sidual(i )               vii  i thh diagonal element of P
                             s(i ) 1  vii

    These are labelled as RSTUDENT by SAS.


   Leverage Values (Hat Diag) – Measure of how far an observation is from the others in terms of the levels
    of the independent variables (not the dependent variable). Observations with values larger than 2p’/n are
    considered to be potentially highly influential.

                                                     v ii  i th diagonal element of P
   DFFITS – Measure of how much an observation has effected its fitted value from the regression model.
                         p'
    Values larger than 2    in absolute value are considered highly influential.
                         n

             ^      ^
            Y i  Y i (i )          ei
                                   vii                                             e2
DFFITSi                                    (n  p'1) s(2i )  (n  p' ) s 2  i
             s(i ) vii         s 1 v 
                                 1  vii                                          1  vii
                               (i )      ii 

   Note that DFFITSi measures the number of standard errors that the fitted value for the ith case has shifted
when it was not used in the regression fit.

   DFBETAS – Measure of how much an observation has effected the estimate of a regression coefficient
                                                                                                       2
    (there is one DFBETA for each regression coefficient, including the intercept). Values larger than    in
                                                                                                        n
    absolute value are considered highly influential.
                    ^        ^
                    j   j (i )
DFBETASj (i )                               c jj  ( j  1) st diagonal element of ( X ' X ) 1
                     s(i ) c jj

Note that DFBETASj(i) measures the number of standard errors that the regression coefficient for the jth
predictor variable has shifted when the ith case was not used in the regression fit.
   Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients, as
    well as the group of fitted values. Values larger than F.50, p ',n  p , are considered highly influential.

                                                                                   '
                                                                  ^         ^
                                                                               ^          ^
                                                                                              
         ^       ^               ^       ^
       (  (i )   )' ( X ' X )( (i )   )                      Y (i )  Y   Y (i )  Y 
                                                    ri  vii 
                                                      2
                                                                                           
Di 
                       p' s 2
                                                       
                                                        1 v  
                                                     p'                          2
                                                             ii              p' s

  Note that Di is like an F-statistic used for testing K '   m where K’=I. Can also be thought of as the shift in
a 100(1-)100% confidence ellipse when Di  Fa , p ',n  p ' when the ith case is not used to fit the regression model.


   COVRATIO – Measure of the impact of each observation on the variances (and standard errors) of the
                                                                                           3 p'
    regression coefficients and their covariances. Values outside interval 1                   considered highly influential.
                                                                                            n
                                                       det(s(2i ) ( X (' i ) X (i ) ) 1 )
                                        COVRATIO 
                                                         det(s 2 ( X ' X ) 1 )
                                                    i


   Variance Inflation Factor (VIF) – Measure of how highly correlated each independent variable is with
    the other predictors in the model. Values larger than 10 for a predictor imply large inflation of standard
    errors of regression coefficients due to this variable being in model.

          1
VIFk         where R k2 is the coefficient of multiple determination when Xk is regressed on the p-1 remaining
       1  Rk
            2


independent variables.

                     Obtaining Influence Statistics and Studentized Residuals in SPSS

A. Choose ANALYZE, REGRESSION, LINEAR, and input the Dependent variable and set of Independent
   variables from your model of interest (possibly having been chosen via an automated model selection
   method).
B. Under STATISTICS, select Collinearity Diagnostics, Casewise Diagnostics and All Cases and
   CONTINUE
C. Under PLOTS, select Y:*SRESID and X:*ZPRED. Also choose HISTOGRAM. These give a plot of
   studentized residuals versus standardized predicted values, and a histogram of standardized residuals
   (residual/sqrt(MSE)). Select CONTINUE.
D. Under SAVE, select Studentized Residuals, Cook’s, Leverage Values, Covariance Ratio, Standardized
   DFBETAS, Standardized DFFITS. Select CONTINUE. The results will be added to your original data
   worksheet.

                      Obtaining Influence Statistics and Studentized Residuals in SAS
PROC REG;             MODEL Y = X1 X2 … XP / R INFLUENCE VIF;                             RUN;

                      Obtaining Influence Statistics and Studentized Residuals in SAS

reg1.reg <- lm(y ~ x1 + … + xp)
reg1.rstudent <- rstudent(reg1.reg)
reg1.inf <- influence.measures(reg1.reg)

								
To top