Lecture 9: Multiple Regression
I. The Functional Form of the Relationship
Now we’ll be looking at relationships in which there is more than one explanatory
variable. The usual hypothesized relationship is this:
y 1 x1 2 x 2 3 x3 ...
In order to have all the coefficients look the same, sometimes we use this instead:
y 0 1 x1 2 x 2 3 x3 ...
But the idea is the same regardless: we are trying to estimate an intercept term plus the
coefficient on each explanatory variable.
As with single regression, the functional form here assumes a linear relationship. And
the same assumptions about the error term apply.
The best-fit line, which is our predicted relationship, is like so:
y a b1 x1 b2 x 2 b3 x3 ...
There is a formula for the OLS estimate for the parameters, but we won’t learn it here. It
is similar to the one for simple regression, but more complicated.
II. The Regression Output
The regression output can be interpreted in the same way as with simple regression. We
can perform hypothesis test and create confidence intervals in the same way.
As with the simple regression, we can look at R-squared to find out how much of the
variation in the dependent variable is explained by variation in the independent variables.
However, you should note that the R-squared will always increase when we add more
explanatory variables, so you should not assume that a higher R-squared means the
explanatory variables you’ve added are necessarily important ones. The adjusted R-
squared, also given in the regression output, takes this effect into account. Even so, you
need to look at the significance of each variable to see its effect.
There is another hypothesis test you might be interested in. What if you want to know
whether the whole predicted relationship (not just the effect of one variable) is
significant? The null hypothesis is that all the coefficients are equal to zero; the
alternative is that at least some of them are not equal to zero. The statistic to consider is
called the F-statistic. We won’t discuss how this statistic is constructed. But it works
much the same as a t-statistic; the larger its value, the greater the significance. The
“Significance F” in the regression output tells you the F-statistic’s equivalent of the t-
statistic’s p-value. It tells you the smallest Type I error (alpha) that would still allow you
to reject the null hypothesis.
Example: The publicexpend data set gives the amount of public expenditures per capita
in each of the lower-48 states, as well as various possible explanatory variables. Run a
regression of expenditures on the economic ability index, metropolitan population
percentage, population growth rate, youth population percentage, and elderly population
III. Dummy Variables
Some data – what we have called nominal data – does not take numerical form. Yet we
might think such variables have an important effect on the dependent variable. How can
we take these into account?
We use what is called a “dummy” variable. The dummy is set equal to 1 if the
observation has a particular characteristic, and it’s set equal to 0 otherwise. For example,
if your observations are people, you could have a FEMALE dummy that equals 1 for a
female observation and 0 for a male observation.
What you’re essentially doing is treating the 0-type as the default. The regression results
tell you the relationship for that group. The coefficient on the dummy tells you the effect
that membership in that group has on the explanatory variable. You can think of this as
the increase in the intercept for that group. That is, the coefficient on “intercept” is the
vertical intercept for default group; to get the intercept for the other group, add the
coefficient on the dummy to the “intercept” coefficient.
Example: The publicexpend data set again. Look at the residuals from the previous
regression; you might notice that Western states seem more likely to have positive
residuals. So we create a dummy variable, WEST, that codes for Western states. Then
we run the regression with this as an additional variable. The coefficient on WEST is
35.47, meaning that Western states tend to spend $35.47 more per capita than non-
Western states (controlling for the other variables we’re looking at).
You can have multiple dummies. For instance, you might deal with race by coding for
BLACK and ASIAN. If an observation is neither black nor Asian, the value of both
dummies is 0. Note 1: You must pick one group as your default. It does not matter
which. But your regression will not work if you create a dummy for every group. Note
2: For m different groups, you need m – 1 dummies. You can’t do it with just one. If,
for example, you create a race dummy coded as 1 for blacks and 2 for Asians, you’re
effecting assuming an Asian is “twice” a black person (whatever that means).
IV. Quadratic Functional Forms
While the functional form given earlier is linear, we can actually use it to estimate some
non-linear relationships. All we have to do is transform the explanatory variables
appropriately. For instance, consider the quadratic function:
y 2 x 2 10 x 5
This is not linear. But what if we just think of x2 as another variable? We could rewrite
the above like so:
y 2 x1 10 x 2 5
where x1 is the squared value of x and x2 is just the unmodified value of x. Notice that
this is a linear relationship. Similarly, we can do the same thing with a quadratic function
whose coefficients we don’t know, and use OLS to estimate those coefficients.
Example: The mileage1 data set. This data set has the miles per gallon and weight of 38
vehicles. Run a simple regression first. We get a statistically significant and positive
relationship. But looking at the residuals and line-fit plots, we observe a U-shape, which
implies the possible existence of a quadratic relationship. How can we estimate this
relationship? Create a new variable that is the weight squared. Run a new regression on
both weight and weight-squared. The results are significant for both coefficients, and the
line-fits and residuals don’t display as much of a pattern. (However, be careful in your
interpretation and extension of the results. The coefficient on weight squared is positive,
which means for high enough weight values, it could appear that weight increases miles
per gallon – hard to believe. Notice that we don’t actually have any values for weight in
the range that would allegedly produce this result. The quadratic form is a good fit for
the data range we have, but probably not outside that range.)
Example: The publicexpend data set again. Look at the residuals and line-fit plots for
MET. Note the U-shape. It appears that public spending is highest in states with very
low and very high metropolitanization, and lowest in states with moderate
metropolitanization. (Why might this be? Maybe there are economies and diseconomies
of scale in dealing with metropolitan populations. Or maybe states with both urban and
rural populations are less likely to reach legislative consensus that will lead to greater
public spending.) How can we estimate this relationship? We create a new variable that
is MET squared and run the regression again. In the results, notice that both MET and
MET-squared have significant coefficients, and the residuals and line-fit plots don’t
display as much of a pattern.