Weighted least squares Heteroscedasticity that is related to the by qok10781

VIEWS: 393 PAGES: 20

									                                 Weighted least squares

     Heteroscedasticity that is related to the level of the target (y) variable can be handled
using variance–stabilizing transformations. Sometimes, nonconstant variance is driven by
other characteristics in the data. In that case, weighted least squares is the appropriate
     The idea behind weighted least squares (WLS) is that least squares is still a good
thing to do if the target and predicting variables are transformed to give a model with
errors with constant variance. Say V (εi ) = σi . To keep things simple, consider simple
regression. The regression model is

                                    yi = β0 + β1 xi + εi .

If we divide both sides of this equation by σi we get

                              yi         1            xi         εi
                                 = β0         + β1           +      .
                              σi         σi           σi         σi

This can be rewritten
                                  yi = β0 z1i + β1 z2i + δi ,

where yi , z1i , z2i , and δi are the obvious substitutions from the previous equation and
V (δi ) = 1 for all i. Thus, ordinary least squares (OLS) estimation (without an intercept
term) of y ∗ on z1 and z2 gives fully efficient estimates of β0 and β1 . Note that using a
constant multiple of σi works just as well, since the only requirement is that V (δi ) be
constant for all i.
     The value WTi = 1/σi is the ith value of the weighting variable. Ordinary least squares

is a special case of WLS with WTi = 1 for all i (and, in fact, most regression packages only
include code for WLS, with OLS the default special case). The problem is that σi is
unknown, and must be estimated.
     It is important to recognize the reasons behind the use of weighted least squares. The
goal is not to improve measures of fit like R2 or F ; rather, the goal is to analyze the
data in an appropriate fashion. There are several advantages to addressing nonconstant

c 2009, Jeffrey S. Simonoff                                                                   1
(1) The estimates of β are more efficient. That is, on average, the WLS estimates should
    be closer to the true regression coefficients than the OLS estimates are.
(2) Predictions are more sensible. If the underlying variability of a certain type of ob-
    servation is larger than that for another type of observation, the prediction interval
    should reflect that. This is not done under OLS, but it is under WLS.
    Weighted least squares can be formulated more generally using matrix notation, since
it is just a special case of generalized least squares. The model is

                                        y = Xβ + ε,

with E(ε) = 0 and V ar(ε) = V σ 2 , where V is the n×n matrix with σi /σ 2 on the diagonal
and zeroes elsewhere. Matrix manipulations are then as follows:

                             ⇒ V −1/2 y = V −1/2 Xβ + V −1/2 ε

                                  ⇔ y∗ = Zβ + δ.

                                β = (Z ′ Z)−1 Z ′ y∗

                                   = (X ′ V −1 X)−1 X ′ V −1 y.

               Addressing nonconstant variance — practical issues

    Consider a regression analysis based on regressing y on x. Examination of residual
plots (including plots of residuals versus each of the predictors xj ) suggests potential
nonconstant variance (heteroscedasticity). What should you do? The following steps
summarize a systematic approach to the problem. As we go along, we’ll examine a real
data set. The data have as a target variable the price of a 2003 model automobile, while the
predicting variables are the horsepower of the auto’s engine and the number of cylinders
in the engine (source: Consumer Reports, April 2003; the values given refer to the top–of–
the–line model).
    A scatter plot of the variables immediately suggests heteroscedasticity. Since the
target variable here is long right–tailed and represents dollars, a reasonable correction is a
variance–stabilizing transformation (taking logs, or a semilog model).

c 2009, Jeffrey S. Simonoff                                                                   2
    The logged plot looks much better, but there are still signs of nonconstant variance,
particularly in the plot on horsepower.

c 2009, Jeffrey S. Simonoff                                                              3
    Here are regression results.

    Regression Analysis: Logged price versus Horsepower, Cylinders

    The regression equation is
    Logged price = 3.91 + 0.00184 Horsepower + 0.0218 Cylinders

    Predictor           Coef         SE Coef             T         P         VIF
    Constant         3.90626         0.03955         98.76     0.000
    Horsepow       0.0018361       0.0001590         11.55     0.000         1.8
    Cylinder        0.021834        0.008750          2.50     0.015         1.8

    S = 0.08007         R-Sq = 79.7%           R-Sq(adj) = 79.2%

    Analysis of Variance

    Source                DF             SS             MS          F          P
    Regression             2         2.0616         1.0308     160.77      0.000
    Residual Error        82         0.5258         0.0064

    The regression fits well (R2 = 79.7%), and each variable is highly statistically signifi-
cant. Here are residual plots:

c 2009, Jeffrey S. Simonoff                                                                4
c 2009, Jeffrey S. Simonoff   5
    These plots constitute step 1 in the analysis of potential heteroscedasticity:

(1) Save the standardized residuals from the regression (SRES, say). Plot them
    versus any variables that make sense.

    In this case there appears to be heteroscedasticity related to each of the variables.
Note, by the way, that there are potential outliers in the data (standardized residuals
around ±3), but they might not really be outliers — they might be far from the regression
line because of higher variability off the line.

(2) Now, create two new variables: the absolute values of the standardized
    residuals (ABSRES, say), and the natural logarithm of the squares of the
    standardized residuals (LGSRESSQ, say). The latter variable can be formed
    in MINITAB using the transformation Let ’LGSRESSQ’ = LN(SRES*SRES).

    A simple, but reasonably effective, test for nonconstant variance is Levene’s test.

(3) Regress ABSRES on any variables you have that might be related to the non-
    constant variance. This could include the xj ’s, as well as other variables;
    indicator (0/1) variables are particularly appropriate if nonconstant vari-
    ance is linked to group membership. You should avoid including variables

c 2009, Jeffrey S. Simonoff                                                                6
    that are not related to the nonconstant variance if possible, since their
    inclusion makes it harder to detect nonconstant variance if it exists. A
    significant F –statistic for Levene’s test suggests the presence of noncon-
    stant variance. Minitab includes a version of Levene’s test (under ANOVA / Test
    for Equal Variances) if the predictors being used are only categorical; we’ll talk more
    about that when we discuss analysis of variance models. It’s a good idea to look
    at scatter plots of the absolute residuals versus the Levene’s test predic-
    tors, to make sure that you’re using the right predictors, and that unusual
    observations aren’t unduly affecting the test (plots using the square root
    of the absolute residuals on the vertical axis are sometimes easier to inter-
    pret). DON’T include the target (y) variable as a predictor in the Levene’s
    Here are scatter plots of the absolute standardized residuals versus the two possible
heteroscedasticity predictors. They show that the absolute residuals are larger on aver-
age (more variability off the regression line) for higher horsepower values, although the
relationship with the number of cylinders is less clear.

c 2009, Jeffrey S. Simonoff                                                                7
   Here is the Levene’s test output:

   Regression Analysis: absres versus Horsepower, Cylinders

   The regression equation is
   absres = 0.502 + 0.00256 Horsepower - 0.0529 Cylinders

   Predictor          Coef        SE Coef            T         P       VIF
   Constant         0.5015         0.2977         1.68     0.096
   Horsepow       0.002560       0.001197         2.14     0.035       1.8
   Cylinder       -0.05294        0.06587        -0.80     0.424       1.8

   S = 0.6028         R-Sq = 6.0%           R-Sq(adj) = 3.8%

   Analysis of Variance

   Source               DF            SS             MS           F       P
   Regression            2        1.9174         0.9587        2.64   0.078
   Residual Error       82       29.7957         0.3634
   Total                84       31.7130

c 2009, Jeffrey S. Simonoff                                                     8
    The test is significant at a .10 level; in particular, it seems that horsepower alone
accounts for the heteroscedasticity (p = .035):

    Regression Analysis: absres versus Horsepower

    The regression equation is
    absres = 0.341 + 0.00192 Horsepower

    Predictor           Coef         SE Coef               T              P
    Constant          0.3413          0.2207            1.55          0.126
    Horsepow       0.0019169       0.0008889            2.16          0.034

    S = 0.6015           R-Sq = 5.3%           R-Sq(adj) = 4.2%

    Analysis of Variance

    Source                 DF            SS             MS                  F       P
    Regression              1        1.6827         1.6827               4.65   0.034
    Residual Error         83       30.0304         0.3618
    Total                  84       31.7130

    Say nonconstant variance is indicated (this could be from the plots, or the Levene’s
test, or both; don’t refer only to the formal test, as it is not guaranteed to always have
enough power to identify nonconstant variance). The variances need to be estimated. An
exponential/linear model for them is often used, whose parameters can be estimated from
the data. The model is
                                                                 

                            var(εi ) = σi = σ 2 exp 
                                                            λj zij  ,                    (1)

where zij is the value of the j th variance predictor for the ith case (these would presumably
be the predictors that were used in part (3) for the Levene’s test).
(4) The coefficient (λj ) values can be estimated using regression. Perform
    a regression of LGSRESSQ on the variance predictor variables, and record
    the fitted regression coefficients (don’t worry about measures of fit for this
    regression). DON’T include the target (y) variable as a potential predictor
    here. Save the fitted values from this regression.

c 2009, Jeffrey S. Simonoff                                                                   9
    Here are the results for the auto price data, using horsepower as the predictor:

    Regression Analysis: lgsressq versus Horsepower

    The regression equation is
    lgsressq = - 2.50 + 0.00548 Horsepower

    Predictor          Coef        SE Coef             T          P
    Constant        -2.5020         0.7204         -3.47      0.001
    Horsepow       0.005477       0.002901          1.89      0.063

    S = 1.963          R-Sq = 4.1%           R-Sq(adj) = 3.0%

    Analysis of Variance

    Source               DF            SS             MS           F          P
    Regression            1        13.734         13.734        3.56      0.063
    Residual Error       83       319.915          3.854
    Total                84       333.649

(5) Create a weight variable for use in the weighted least squares analysis.
    The weights are estimates of the inverse of the variance of the errors for
    each observation. They have the form WT = 1/ exp(FITS1), where FITS1 is
    the variable with fitted values from the regression in step 4.
    So, for the auto price data, the weight variable has the form

                       WT = 1/ exp(−2.50 + .0055 ∗ Horsepower),

based on the regression output.
(6) Perform a weighted least squares regression, specifying WT as the weighting
    variable (click on Options, and enter WT under Weights:). You should redo
    steps 1–3 to make sure that the nonconstant variance has been corrected.
    Here is the regression output for the WLS regression:

    Regression Analysis: Logged price versus Horsepower, Cylinders

    Weighted analysis using weights in wt

c 2009, Jeffrey S. Simonoff                                                              10
   The regression equation is
   Logged price = 3.87 + 0.00205 Horsepower + 0.0202 Cylinders

   Predictor         Coef      SE Coef             T        P     VIF
   Constant       3.86649      0.03763        102.74    0.000
   Horsepow     0.0020550    0.0001834         11.20    0.000     1.8
   Cylinder      0.020231     0.008575          2.36    0.021     1.8

   S = 0.1412        R-Sq = 78.3%        R-Sq(adj) = 77.7%

   Analysis of Variance

   Source              DF          SS             MS         F       P
   Regression           2      5.8836         2.9418    147.60   0.000
   Residual Error      82      1.6343         0.0199
   Total               84      7.5179

c 2009, Jeffrey S. Simonoff                                                11
c 2009, Jeffrey S. Simonoff   12
    The nonconstant variance has apparently been addressed. This is supported by the
Levene’s test (note that you should redo the Levene’s test based on the standardized
residuals after doing WLS to check this, and remember that the Levene’s test regression
does not use weights):

    Regression Analysis: absres versus Horsepower, Cylinders

    The regression equation is
    absres = 0.902 + 0.00135 Horsepower - 0.0717 Cylinders

    Predictor          Coef        SE Coef            T         P         VIF
    Constant         0.9025         0.3010         3.00     0.004
    Horsepow       0.001349       0.001210         1.11     0.268         1.8
    Cylinder       -0.07170        0.06660        -1.08     0.285         1.8

    S = 0.6095           R-Sq = 1.7%         R-Sq(adj) = 0.0%

    Analysis of Variance

    Source                 DF          SS             MS           F        P
    Regression              2      0.5355         0.2677        0.72    0.489
    Residual Error         82     30.4600         0.3715
    Total                  84     30.9955

c 2009, Jeffrey S. Simonoff                                                           13
    There is apparently an outlier (the Saab 9-3, an expensive car with moderate horse-
power and only 4 cylinders), but omitting it changes little. Overall, given the number
of cylinders, larger engine cars cost more (each additional horsepower multiplying the ex-
pected price by 10.00206 = 1.005, or a .5% increase), and given horsepower, each additional
cylinders cost more (two additional cylinders multiply expected price by 10.0404 = 1.098,
or a 9.8% increase).
    The R2 value given by Minitab is not
                                            Residual SS
                                             Total SS
in this case, because of the weighting. In fact, there is no single universally accepted
definition of R2 for weighted least squares fitting. Proposed measures use relationships
between R2 and other statistics in OLS. One such measure is based on the closeness of the
observed and fitted y values,

                Rw1 = corr(Fitted values, Target variable values)2 .

So, the fitted values from the WLS should be saved, and then the correlation coefficient

    Correlations (Pearson)

    Correlation of Logged price and FITS1 = 0.892

Thus, Rw1 = .8922 = .796, which can be compared to the OLS value of .797. Another
common measure (the one Minitab reports) uses the connection between F and R2 in
ordinary least squares regression,

                                  2              pF
                                 Rw2 =                     .
                                            pF + n − p − 1
Thus, the R2 for the WLS fit above is

                                2         (2)(147.6)
                               Rw2 =                   = .783.
                                       (2)(147.6) + 82

c 2009, Jeffrey S. Simonoff                                                               14
The two measures differ in that Rw2 downweights the effect of observations with higher
variability on the measure of the strength of the fit, while Rw1 does not.
    Another possible use of this model is to use it to make predictions of prices from
horsepower and size values. For example, what would we have predicted the price for an
Audi A6 to have been (this car is not in the sample)? Its horsepower was 455, with an
8–cylinder engine; here is Minitab output asking for a prediction interval based on these
horsepower and cylinder values:

  * WARNING * The prediction interval output assumes a weight of 1.
            * An adjustment must be made if a weight other than 1 is used.

  New Obs       Fit        SE Fit            95.0% CI                  95.0% PI
  1          4.9634        0.0353     (   4.8932, 5.0335)       (   4.6739, 5.2528)

  Values of Predictors for New Observations

  New Obs    Horsepow    Cylinder
  1               455        8.00

    There are two things about this output that deserve comment. First, we must remem-
ber that the model was in semilog form, so the intervals given are in terms of logged price.
That is, we must antilog the prediction interval to obtain the version in terms of dollars.
    There is another problem, however, that is highlighted by the warning message that
Minitab gives you. Minitab does not allow a weight to be given for the prediction being
made, which means that unless the appropriate weight turns out to be 1, the prediction
interval is not correct (the standard error of the fitted value and confidence interval are
still correct). We wish to use the weight that the Audi A6 would have been assigned if it
had been included in the sample; from the model used to determine the weights, that is
WT = 1/ exp(−2.50 + .0055 ∗ Horsepower), or 1/ exp(−2.50 + .0055 ∗ 455) = 1.025.
    The standard error of the fitted value (used for confidence intervals) is given correctly,
but we need to calculate the standard error of the predicted value. This equals

             (Standard error of f itted value)2 + (Residual M S)/(W eight).

c 2009, Jeffrey S. Simonoff                                                                15
Then, the prediction interval is

             P redicted value ± tn−p−1 (Standard error of predicted value).

The standard error of the predicted value in this case is    .03532 + .0199/1.025 = .1437.
For n = 85 the appropriate critical value for a 95% interval is 1.989, giving prediction
                         4.9634 ± (1.989)(.1437) = (4.678, 5.249).

Finally, antilogging gives the prediction interval for the Audi A6 price, ($47643, $177509)
[no, that’s not a misprint!], with a point estimate of the price being $91918 (the actual
price was $58700, a good price for such a powerful car).
    The importance of using the WLS model is obvious when this prediction interval is
compared to the one from the OLS model:

  Predicted Values for New Observations

  New Obs     Fit    SE Fit         95.0% CI            95.0% PI
  1       4.91635   0.02762   ( 4.86141, 4.97129) ( 4.74785, 5.08485) X
  X denotes a row with X values away from the center

The point estimate for price is not very different from that from the WLS model ($82480
rather than $91918), but antilogging the interval gives ($55956, $121577), which is slightly
more than half the width of the WLS interval (a better comparison is that it is 40% nar-
rower in the log scale). On the other hand, a prediction interval for a low horsepower auto
would probably be narrower than that from OLS, correctly reflecting the lower variability
off the regression line there. Note also that the Audi A6 is considered very unusual in the
OLS prediction, because of its high horsepower, but this is not true in the WLS prediction,
since it is downweighted so strongly.
    Note that this same type of correction is needed even for rough prediction intervals.
The value of s from the WLS output is not in any way comparable to that from an OLS
fit. In an OLS fit, the rough prediction interval for any future observation is ±2ˆ , and s is

c 2009, Jeffrey S. Simonoff                                                                16
        ˆ                                                                                  σ
used as σ . In a WLS fit, the rough prediction interval for the ith future observation is ±2ˆi ,
requiring a (potentially) different estimate of the standard deviation for each prediction.
This corresponds to ±2s/ wi , where wi is the appropriate weight (this corresponds to
the exact interval constructed above, except that 2 is used instead of the exact t-based
critical value, and the standard error of the fitted value is ignored). Thus, s does not have
any physical interpretation in a WLS model unless it is paired with an appropriate weight
    Steps (4) and (5) are more straightforward in the special case where the nonconstant
variance is being driven by the existence of two or more subgroups in the data, as follows:
(4′ ) Create a variable that defines the subgroups by appropriate codes (e.g.,
    1, 2, 3, etc.); call this variable GROUP. Now, construct descriptive statistics
    of the standardized residuals variable SRES, entering GROUP under the By
    variable option.
(5′ ) The weight variable for the weighted least squares analysis contains esti-
    mates of 1 over the variance of the errors for each observation. These are
    obtained from the output in (4′ ) by looking under StDev for each group.
    So, for example, say there were three groups in the data coded 1, 2 and
    3, and for group 1 the entry under StDev is 1.5, for group 2 the entry un-
    der StDev is 1.1, and for group 3 the entry under StDev is .8. The weight
    variable then equals 1/1.52 for all observations in group 1, 1/1.12 for all
    observations in group 2, and 1/.82 for all observations in group 3.
    Say we decided to pursue the heteroscedasticity based on the number of cylinders of
the auto. Here are standardized residuals from the OLS regression separated by cylinders:

  Descriptive Statistics: SRES3 by Cylinders

  Variable      Cylinder          N        Mean        Median        TrMean          StDev
  SRES3         4                21      -0.057        -0.331        -0.146          1.088
                6                47       0.049        -0.085         0.027          0.860
                8                17      -0.070        -0.251        -0.052          1.314

 c 2009, Jeffrey S. Simonoff                                                                   17
  Variable      Cylinder     SE Mean       Minimum       Maximum            Q1            Q3
  SRES3         4              0.238        -1.271         2.845        -0.996         0.621
                6              0.125        -1.659         2.535        -0.551         0.712
                8              0.319        -2.836         2.423        -1.032         0.996

    The weight variable would be constructed using Minitab as follows:

        Let WT = (Cylinders = 4)/(1.088 ∗ 1.088) + (Cylinders = 6)/(.86 ∗ .86)

                    +(Cylinders = 8)/(1.314 ∗ 1.314)

Here is WLS output using this weight variable:

    Regression Analysis: Logged price versus Horsepower, Cylinders

    Weighted analysis using weights in wt

    The regression equation is
    Logged price = 3.90 + 0.00192 Horsepower + 0.0197 Cylinders

    Predictor            Coef          SE Coef             T        P            VIF
    Constant          3.90139          0.04519         86.33    0.000
    Horsepow        0.0019177        0.0001609         11.92    0.000            1.5
    Cylinder         0.019680         0.009389          2.10    0.039            1.5

    S = 0.07866          R-Sq = 76.7%            R-Sq(adj) = 76.1%

    Analysis of Variance

    Source                 DF             SS              MS         F          P
    Regression              2        1.66869         0.83434    134.84      0.000
    Residual Error         82        0.50740         0.00619
    Total                  84        2.17609

    The results are very similar to the earlier WLS model, although the heteroscedasticity
isn’t taken care of quite as well:

c 2009, Jeffrey S. Simonoff                                                                      18
    Note, by the way, that using standard deviations this way avoids the need for the
assumption of a specific model for the variances, as in equation (1), which is a good thing.
However, this method can only be used if the variable that determines the variances is
categorical (more generally, this method only works if all variables that determine the
variances are categorical). If any of the variables that determine variances are numerical,
you must use the regression–based method of estimating weights, either using the regres-
sion program, or (if any of the variables are categorical) the general linear model program
that we will be discussing later.

    There is another mechanism by which variances of errors in a regression model can
be different. Say that the response variable at the level of an individual follows the usual
regression model,

                             yi = β0 + β1 xi1 + · · · + βp xpi + εi ,

with εi ∼ N (0, σ 2 ). Imagine, however, that the ith observed response is actually an average
y i for a sample of size ni with the observed predictor values {x1i , . . . , xpi }. The model is

c 2009, Jeffrey S. Simonoff                                                                     19
                              y i = β0 + β1 xi1 + · · · + βp xpi + εi ,

                              V (˜i ) = V (y i |{x1i , . . . , xpi }) =      .
An example of this kind of situation could be as follows. Say you were interested in
modeling the relationship between student test scores and (among other things) income.
While it might be possible to obtain test scores at the level of individual students, it would
be impossible to get incomes at that level because of privacy issues. On the other hand,
average incomes at the level of census tract or school district might be available, and could
be used to predict average test scores at that same level.
       This is just a standard heteroscedasticity model, and WLS is used to fit it. In fact,
this is a particularly simple case, since the weights do not need to be estimated at all;
since V (˜i ) = σ 2 /ni , the weight for the ith observation is just ni . That is, quite naturally,
observations based on larger samples are weighted more heavily in estimating the regression
       It should be noted, however, that this sort of aggregation is not without problems.
Inferences made from aggregated data about individuals are called ecological inferences,
and they can be very misleading. As is always the case, we must be aware of confounding
effects of missing predictors; for example, if school districts with wealthier residents also
have lower proportions of non-native English speakers, a positive slope for income could be
reflecting an English speaker effect, rather than an income effect. In addition, ecological
inferences potentially suffer from aggregation bias, whereby the information lost when
aggregating (as it is clear that some information will be lost) is different for some individuals
than for others (for example, if there is more variability in incomes in some school districts
compared to others, more information is lost in those school districts), resulting in biased

c 2009, Jeffrey S. Simonoff                                                                       20

To top