VIEWS: 393 PAGES: 20 CATEGORY: Technology POSTED ON: 2/18/2010 Public Domain
Weighted least squares Heteroscedasticity that is related to the level of the target (y) variable can be handled using variance–stabilizing transformations. Sometimes, nonconstant variance is driven by other characteristics in the data. In that case, weighted least squares is the appropriate solution. The idea behind weighted least squares (WLS) is that least squares is still a good thing to do if the target and predicting variables are transformed to give a model with 2 errors with constant variance. Say V (εi ) = σi . To keep things simple, consider simple regression. The regression model is yi = β0 + β1 xi + εi . If we divide both sides of this equation by σi we get yi 1 xi εi = β0 + β1 + . σi σi σi σi This can be rewritten ∗ yi = β0 z1i + β1 z2i + δi , ∗ where yi , z1i , z2i , and δi are the obvious substitutions from the previous equation and V (δi ) = 1 for all i. Thus, ordinary least squares (OLS) estimation (without an intercept term) of y ∗ on z1 and z2 gives fully eﬃcient estimates of β0 and β1 . Note that using a constant multiple of σi works just as well, since the only requirement is that V (δi ) be constant for all i. The value WTi = 1/σi is the ith value of the weighting variable. Ordinary least squares 2 is a special case of WLS with WTi = 1 for all i (and, in fact, most regression packages only 2 include code for WLS, with OLS the default special case). The problem is that σi is unknown, and must be estimated. It is important to recognize the reasons behind the use of weighted least squares. The goal is not to improve measures of ﬁt like R2 or F ; rather, the goal is to analyze the data in an appropriate fashion. There are several advantages to addressing nonconstant variance: c 2009, Jeﬀrey S. Simonoﬀ 1 (1) The estimates of β are more eﬃcient. That is, on average, the WLS estimates should be closer to the true regression coeﬃcients than the OLS estimates are. (2) Predictions are more sensible. If the underlying variability of a certain type of ob- servation is larger than that for another type of observation, the prediction interval should reﬂect that. This is not done under OLS, but it is under WLS. Weighted least squares can be formulated more generally using matrix notation, since it is just a special case of generalized least squares. The model is y = Xβ + ε, 2 with E(ε) = 0 and V ar(ε) = V σ 2 , where V is the n×n matrix with σi /σ 2 on the diagonal and zeroes elsewhere. Matrix manipulations are then as follows: ⇒ V −1/2 y = V −1/2 Xβ + V −1/2 ε ⇔ y∗ = Zβ + δ. Thus, ˆ β = (Z ′ Z)−1 Z ′ y∗ = (X ′ V −1 X)−1 X ′ V −1 y. Addressing nonconstant variance — practical issues Consider a regression analysis based on regressing y on x. Examination of residual plots (including plots of residuals versus each of the predictors xj ) suggests potential nonconstant variance (heteroscedasticity). What should you do? The following steps summarize a systematic approach to the problem. As we go along, we’ll examine a real data set. The data have as a target variable the price of a 2003 model automobile, while the predicting variables are the horsepower of the auto’s engine and the number of cylinders in the engine (source: Consumer Reports, April 2003; the values given refer to the top–of– the–line model). A scatter plot of the variables immediately suggests heteroscedasticity. Since the target variable here is long right–tailed and represents dollars, a reasonable correction is a variance–stabilizing transformation (taking logs, or a semilog model). c 2009, Jeﬀrey S. Simonoﬀ 2 The logged plot looks much better, but there are still signs of nonconstant variance, particularly in the plot on horsepower. c 2009, Jeﬀrey S. Simonoﬀ 3 Here are regression results. Regression Analysis: Logged price versus Horsepower, Cylinders The regression equation is Logged price = 3.91 + 0.00184 Horsepower + 0.0218 Cylinders Predictor Coef SE Coef T P VIF Constant 3.90626 0.03955 98.76 0.000 Horsepow 0.0018361 0.0001590 11.55 0.000 1.8 Cylinder 0.021834 0.008750 2.50 0.015 1.8 S = 0.08007 R-Sq = 79.7% R-Sq(adj) = 79.2% Analysis of Variance Source DF SS MS F P Regression 2 2.0616 1.0308 160.77 0.000 Residual Error 82 0.5258 0.0064 The regression ﬁts well (R2 = 79.7%), and each variable is highly statistically signiﬁ- cant. Here are residual plots: c 2009, Jeﬀrey S. Simonoﬀ 4 c 2009, Jeﬀrey S. Simonoﬀ 5 These plots constitute step 1 in the analysis of potential heteroscedasticity: (1) Save the standardized residuals from the regression (SRES, say). Plot them versus any variables that make sense. In this case there appears to be heteroscedasticity related to each of the variables. Note, by the way, that there are potential outliers in the data (standardized residuals around ±3), but they might not really be outliers — they might be far from the regression line because of higher variability oﬀ the line. (2) Now, create two new variables: the absolute values of the standardized residuals (ABSRES, say), and the natural logarithm of the squares of the standardized residuals (LGSRESSQ, say). The latter variable can be formed in MINITAB using the transformation Let ’LGSRESSQ’ = LN(SRES*SRES). A simple, but reasonably eﬀective, test for nonconstant variance is Levene’s test. (3) Regress ABSRES on any variables you have that might be related to the non- constant variance. This could include the xj ’s, as well as other variables; indicator (0/1) variables are particularly appropriate if nonconstant vari- ance is linked to group membership. You should avoid including variables c 2009, Jeﬀrey S. Simonoﬀ 6 that are not related to the nonconstant variance if possible, since their inclusion makes it harder to detect nonconstant variance if it exists. A signiﬁcant F –statistic for Levene’s test suggests the presence of noncon- stant variance. Minitab includes a version of Levene’s test (under ANOVA / Test for Equal Variances) if the predictors being used are only categorical; we’ll talk more about that when we discuss analysis of variance models. It’s a good idea to look at scatter plots of the absolute residuals versus the Levene’s test predic- tors, to make sure that you’re using the right predictors, and that unusual observations aren’t unduly aﬀecting the test (plots using the square root of the absolute residuals on the vertical axis are sometimes easier to inter- pret). DON’T include the target (y) variable as a predictor in the Levene’s test. Here are scatter plots of the absolute standardized residuals versus the two possible heteroscedasticity predictors. They show that the absolute residuals are larger on aver- age (more variability oﬀ the regression line) for higher horsepower values, although the relationship with the number of cylinders is less clear. c 2009, Jeﬀrey S. Simonoﬀ 7 Here is the Levene’s test output: Regression Analysis: absres versus Horsepower, Cylinders The regression equation is absres = 0.502 + 0.00256 Horsepower - 0.0529 Cylinders Predictor Coef SE Coef T P VIF Constant 0.5015 0.2977 1.68 0.096 Horsepow 0.002560 0.001197 2.14 0.035 1.8 Cylinder -0.05294 0.06587 -0.80 0.424 1.8 S = 0.6028 R-Sq = 6.0% R-Sq(adj) = 3.8% Analysis of Variance Source DF SS MS F P Regression 2 1.9174 0.9587 2.64 0.078 Residual Error 82 29.7957 0.3634 Total 84 31.7130 c 2009, Jeﬀrey S. Simonoﬀ 8 The test is signiﬁcant at a .10 level; in particular, it seems that horsepower alone accounts for the heteroscedasticity (p = .035): Regression Analysis: absres versus Horsepower The regression equation is absres = 0.341 + 0.00192 Horsepower Predictor Coef SE Coef T P Constant 0.3413 0.2207 1.55 0.126 Horsepow 0.0019169 0.0008889 2.16 0.034 S = 0.6015 R-Sq = 5.3% R-Sq(adj) = 4.2% Analysis of Variance Source DF SS MS F P Regression 1 1.6827 1.6827 4.65 0.034 Residual Error 83 30.0304 0.3618 Total 84 31.7130 Say nonconstant variance is indicated (this could be from the plots, or the Levene’s test, or both; don’t refer only to the formal test, as it is not guaranteed to always have enough power to identify nonconstant variance). The variances need to be estimated. An exponential/linear model for them is often used, whose parameters can be estimated from the data. The model is var(εi ) = σi = σ 2 exp 2 λj zij , (1) j where zij is the value of the j th variance predictor for the ith case (these would presumably be the predictors that were used in part (3) for the Levene’s test). (4) The coeﬃcient (λj ) values can be estimated using regression. Perform a regression of LGSRESSQ on the variance predictor variables, and record the ﬁtted regression coeﬃcients (don’t worry about measures of ﬁt for this regression). DON’T include the target (y) variable as a potential predictor here. Save the ﬁtted values from this regression. c 2009, Jeﬀrey S. Simonoﬀ 9 Here are the results for the auto price data, using horsepower as the predictor: Regression Analysis: lgsressq versus Horsepower The regression equation is lgsressq = - 2.50 + 0.00548 Horsepower Predictor Coef SE Coef T P Constant -2.5020 0.7204 -3.47 0.001 Horsepow 0.005477 0.002901 1.89 0.063 S = 1.963 R-Sq = 4.1% R-Sq(adj) = 3.0% Analysis of Variance Source DF SS MS F P Regression 1 13.734 13.734 3.56 0.063 Residual Error 83 319.915 3.854 Total 84 333.649 (5) Create a weight variable for use in the weighted least squares analysis. The weights are estimates of the inverse of the variance of the errors for each observation. They have the form WT = 1/ exp(FITS1), where FITS1 is the variable with ﬁtted values from the regression in step 4. So, for the auto price data, the weight variable has the form WT = 1/ exp(−2.50 + .0055 ∗ Horsepower), based on the regression output. (6) Perform a weighted least squares regression, specifying WT as the weighting variable (click on Options, and enter WT under Weights:). You should redo steps 1–3 to make sure that the nonconstant variance has been corrected. Here is the regression output for the WLS regression: Regression Analysis: Logged price versus Horsepower, Cylinders Weighted analysis using weights in wt c 2009, Jeﬀrey S. Simonoﬀ 10 The regression equation is Logged price = 3.87 + 0.00205 Horsepower + 0.0202 Cylinders Predictor Coef SE Coef T P VIF Constant 3.86649 0.03763 102.74 0.000 Horsepow 0.0020550 0.0001834 11.20 0.000 1.8 Cylinder 0.020231 0.008575 2.36 0.021 1.8 S = 0.1412 R-Sq = 78.3% R-Sq(adj) = 77.7% Analysis of Variance Source DF SS MS F P Regression 2 5.8836 2.9418 147.60 0.000 Residual Error 82 1.6343 0.0199 Total 84 7.5179 c 2009, Jeﬀrey S. Simonoﬀ 11 c 2009, Jeﬀrey S. Simonoﬀ 12 The nonconstant variance has apparently been addressed. This is supported by the Levene’s test (note that you should redo the Levene’s test based on the standardized residuals after doing WLS to check this, and remember that the Levene’s test regression does not use weights): Regression Analysis: absres versus Horsepower, Cylinders The regression equation is absres = 0.902 + 0.00135 Horsepower - 0.0717 Cylinders Predictor Coef SE Coef T P VIF Constant 0.9025 0.3010 3.00 0.004 Horsepow 0.001349 0.001210 1.11 0.268 1.8 Cylinder -0.07170 0.06660 -1.08 0.285 1.8 S = 0.6095 R-Sq = 1.7% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 2 0.5355 0.2677 0.72 0.489 Residual Error 82 30.4600 0.3715 Total 84 30.9955 c 2009, Jeﬀrey S. Simonoﬀ 13 There is apparently an outlier (the Saab 9-3, an expensive car with moderate horse- power and only 4 cylinders), but omitting it changes little. Overall, given the number of cylinders, larger engine cars cost more (each additional horsepower multiplying the ex- pected price by 10.00206 = 1.005, or a .5% increase), and given horsepower, each additional cylinders cost more (two additional cylinders multiply expected price by 10.0404 = 1.098, or a 9.8% increase). The R2 value given by Minitab is not Residual SS 1− Total SS in this case, because of the weighting. In fact, there is no single universally accepted deﬁnition of R2 for weighted least squares ﬁtting. Proposed measures use relationships between R2 and other statistics in OLS. One such measure is based on the closeness of the observed and ﬁtted y values, 2 Rw1 = corr(Fitted values, Target variable values)2 . So, the ﬁtted values from the WLS should be saved, and then the correlation coeﬃcient calculated: Correlations (Pearson) Correlation of Logged price and FITS1 = 0.892 2 Thus, Rw1 = .8922 = .796, which can be compared to the OLS value of .797. Another common measure (the one Minitab reports) uses the connection between F and R2 in ordinary least squares regression, 2 pF Rw2 = . pF + n − p − 1 2 Thus, the R2 for the WLS ﬁt above is 2 (2)(147.6) Rw2 = = .783. (2)(147.6) + 82 c 2009, Jeﬀrey S. Simonoﬀ 14 2 The two measures diﬀer in that Rw2 downweights the eﬀect of observations with higher 2 variability on the measure of the strength of the ﬁt, while Rw1 does not. Another possible use of this model is to use it to make predictions of prices from horsepower and size values. For example, what would we have predicted the price for an Audi A6 to have been (this car is not in the sample)? Its horsepower was 455, with an 8–cylinder engine; here is Minitab output asking for a prediction interval based on these horsepower and cylinder values: * WARNING * The prediction interval output assumes a weight of 1. * An adjustment must be made if a weight other than 1 is used. New Obs Fit SE Fit 95.0% CI 95.0% PI 1 4.9634 0.0353 ( 4.8932, 5.0335) ( 4.6739, 5.2528) Values of Predictors for New Observations New Obs Horsepow Cylinder 1 455 8.00 There are two things about this output that deserve comment. First, we must remem- ber that the model was in semilog form, so the intervals given are in terms of logged price. That is, we must antilog the prediction interval to obtain the version in terms of dollars. There is another problem, however, that is highlighted by the warning message that Minitab gives you. Minitab does not allow a weight to be given for the prediction being made, which means that unless the appropriate weight turns out to be 1, the prediction interval is not correct (the standard error of the ﬁtted value and conﬁdence interval are still correct). We wish to use the weight that the Audi A6 would have been assigned if it had been included in the sample; from the model used to determine the weights, that is WT = 1/ exp(−2.50 + .0055 ∗ Horsepower), or 1/ exp(−2.50 + .0055 ∗ 455) = 1.025. The standard error of the ﬁtted value (used for conﬁdence intervals) is given correctly, but we need to calculate the standard error of the predicted value. This equals (Standard error of f itted value)2 + (Residual M S)/(W eight). c 2009, Jeﬀrey S. Simonoﬀ 15 Then, the prediction interval is P redicted value ± tn−p−1 (Standard error of predicted value). α/2 The standard error of the predicted value in this case is .03532 + .0199/1.025 = .1437. For n = 85 the appropriate critical value for a 95% interval is 1.989, giving prediction interval 4.9634 ± (1.989)(.1437) = (4.678, 5.249). Finally, antilogging gives the prediction interval for the Audi A6 price, ($47643, $177509) [no, that’s not a misprint!], with a point estimate of the price being $91918 (the actual price was $58700, a good price for such a powerful car). The importance of using the WLS model is obvious when this prediction interval is compared to the one from the OLS model: Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 4.91635 0.02762 ( 4.86141, 4.97129) ( 4.74785, 5.08485) X X denotes a row with X values away from the center The point estimate for price is not very diﬀerent from that from the WLS model ($82480 rather than $91918), but antilogging the interval gives ($55956, $121577), which is slightly more than half the width of the WLS interval (a better comparison is that it is 40% nar- rower in the log scale). On the other hand, a prediction interval for a low horsepower auto would probably be narrower than that from OLS, correctly reﬂecting the lower variability oﬀ the regression line there. Note also that the Audi A6 is considered very unusual in the OLS prediction, because of its high horsepower, but this is not true in the WLS prediction, since it is downweighted so strongly. Note that this same type of correction is needed even for rough prediction intervals. The value of s from the WLS output is not in any way comparable to that from an OLS σ ﬁt. In an OLS ﬁt, the rough prediction interval for any future observation is ±2ˆ , and s is c 2009, Jeﬀrey S. Simonoﬀ 16 ˆ σ used as σ . In a WLS ﬁt, the rough prediction interval for the ith future observation is ±2ˆi , requiring a (potentially) diﬀerent estimate of the standard deviation for each prediction. √ This corresponds to ±2s/ wi , where wi is the appropriate weight (this corresponds to the exact interval constructed above, except that 2 is used instead of the exact t-based critical value, and the standard error of the ﬁtted value is ignored). Thus, s does not have any physical interpretation in a WLS model unless it is paired with an appropriate weight value. Steps (4) and (5) are more straightforward in the special case where the nonconstant variance is being driven by the existence of two or more subgroups in the data, as follows: (4′ ) Create a variable that deﬁnes the subgroups by appropriate codes (e.g., 1, 2, 3, etc.); call this variable GROUP. Now, construct descriptive statistics of the standardized residuals variable SRES, entering GROUP under the By variable option. (5′ ) The weight variable for the weighted least squares analysis contains esti- mates of 1 over the variance of the errors for each observation. These are obtained from the output in (4′ ) by looking under StDev for each group. So, for example, say there were three groups in the data coded 1, 2 and 3, and for group 1 the entry under StDev is 1.5, for group 2 the entry un- der StDev is 1.1, and for group 3 the entry under StDev is .8. The weight variable then equals 1/1.52 for all observations in group 1, 1/1.12 for all observations in group 2, and 1/.82 for all observations in group 3. Say we decided to pursue the heteroscedasticity based on the number of cylinders of the auto. Here are standardized residuals from the OLS regression separated by cylinders: Descriptive Statistics: SRES3 by Cylinders Variable Cylinder N Mean Median TrMean StDev SRES3 4 21 -0.057 -0.331 -0.146 1.088 6 47 0.049 -0.085 0.027 0.860 8 17 -0.070 -0.251 -0.052 1.314 c 2009, Jeﬀrey S. Simonoﬀ 17 Variable Cylinder SE Mean Minimum Maximum Q1 Q3 SRES3 4 0.238 -1.271 2.845 -0.996 0.621 6 0.125 -1.659 2.535 -0.551 0.712 8 0.319 -2.836 2.423 -1.032 0.996 The weight variable would be constructed using Minitab as follows: Let WT = (Cylinders = 4)/(1.088 ∗ 1.088) + (Cylinders = 6)/(.86 ∗ .86) +(Cylinders = 8)/(1.314 ∗ 1.314) Here is WLS output using this weight variable: Regression Analysis: Logged price versus Horsepower, Cylinders Weighted analysis using weights in wt The regression equation is Logged price = 3.90 + 0.00192 Horsepower + 0.0197 Cylinders Predictor Coef SE Coef T P VIF Constant 3.90139 0.04519 86.33 0.000 Horsepow 0.0019177 0.0001609 11.92 0.000 1.5 Cylinder 0.019680 0.009389 2.10 0.039 1.5 S = 0.07866 R-Sq = 76.7% R-Sq(adj) = 76.1% Analysis of Variance Source DF SS MS F P Regression 2 1.66869 0.83434 134.84 0.000 Residual Error 82 0.50740 0.00619 Total 84 2.17609 The results are very similar to the earlier WLS model, although the heteroscedasticity isn’t taken care of quite as well: c 2009, Jeﬀrey S. Simonoﬀ 18 Note, by the way, that using standard deviations this way avoids the need for the assumption of a speciﬁc model for the variances, as in equation (1), which is a good thing. However, this method can only be used if the variable that determines the variances is categorical (more generally, this method only works if all variables that determine the variances are categorical). If any of the variables that determine variances are numerical, you must use the regression–based method of estimating weights, either using the regres- sion program, or (if any of the variables are categorical) the general linear model program that we will be discussing later. There is another mechanism by which variances of errors in a regression model can be diﬀerent. Say that the response variable at the level of an individual follows the usual regression model, yi = β0 + β1 xi1 + · · · + βp xpi + εi , with εi ∼ N (0, σ 2 ). Imagine, however, that the ith observed response is actually an average y i for a sample of size ni with the observed predictor values {x1i , . . . , xpi }. The model is c 2009, Jeﬀrey S. Simonoﬀ 19 thus ˜ y i = β0 + β1 xi1 + · · · + βp xpi + εi , where σ2 ε V (˜i ) = V (y i |{x1i , . . . , xpi }) = . ni An example of this kind of situation could be as follows. Say you were interested in modeling the relationship between student test scores and (among other things) income. While it might be possible to obtain test scores at the level of individual students, it would be impossible to get incomes at that level because of privacy issues. On the other hand, average incomes at the level of census tract or school district might be available, and could be used to predict average test scores at that same level. This is just a standard heteroscedasticity model, and WLS is used to ﬁt it. In fact, this is a particularly simple case, since the weights do not need to be estimated at all; since V (˜i ) = σ 2 /ni , the weight for the ith observation is just ni . That is, quite naturally, ε observations based on larger samples are weighted more heavily in estimating the regression coeﬃcients. It should be noted, however, that this sort of aggregation is not without problems. Inferences made from aggregated data about individuals are called ecological inferences, and they can be very misleading. As is always the case, we must be aware of confounding eﬀects of missing predictors; for example, if school districts with wealthier residents also have lower proportions of non-native English speakers, a positive slope for income could be reﬂecting an English speaker eﬀect, rather than an income eﬀect. In addition, ecological inferences potentially suﬀer from aggregation bias, whereby the information lost when aggregating (as it is clear that some information will be lost) is diﬀerent for some individuals than for others (for example, if there is more variability in incomes in some school districts compared to others, more information is lost in those school districts), resulting in biased inferences. c 2009, Jeﬀrey S. Simonoﬀ 20