Multiple Regression 1
Alternate Types of Regression: Multiple Regression Jason Parrott 5-26-03 AP Statistics
Multiple Regression 2
Single regression techniques use two variables to predict a line of best fit for a certain set of data. This is not always reliable, as there are often many conditions that should be accounted for when drawing a line of best fit. A good example might be that of the size of a deer population. The X variable, average temperature in degrees Celsius, can be related to the endogenous variable Y, or the number of deer in a given area. Using these two variables to create a line of best fit may provide an accurate method for predicting the number of deer given an average temperature. However, there are many other variables such as the amount of grass in the area, how harsh the winter is, the number of predators in the area, and so on. Multiple regression can be used to account for these extra variables. The advantages of using multiple regression are numerous. If more relevant data is included, a more accurate and reliable prediction can be made. The phrase, “multiple regression coefficients” first appeared in the 1903 Biometrika paper “The Law of Ancestral Heredity” by Karl Pearson. From around 1895 Pearson and G. U. Yule had worked on multiple regression and the phrase “double regression” appeared in Pearson’s paper “Mathematical Contributions to the Theory of Evolution.” He defined the coefficient of regression as the ratio of the mean deviation of the fraternity from the mean off-spring to the deviation of the parentage from the mean parent. Other contributions to statistical mathematics by Karl Pearson include standard deviation and the chi-square test.
Assumptions, Limitations, Practical Considerations
Multiple regression shares all the assumptions of other types of correlation. The relationships between variables have to be linear. Other types of regression can be used to determine quadratic, exponential, cubic, and other forms, but are beyond the scope of this paper. Next, there must be the same level of relationship throughout the range of the independent variable. Also, the data must be interval or near-interval and its range is not truncated. Finally, each variable should be chosen to fit the model being tested. The exclusion of important contributing variables or the inclusion of extraneous variables can change the line of best fit and provide erroneous predictions.
Multiple Regression 3
The major conceptual limitation of all regression techniques is that only relationships can be determined. One specific cause of the shape of the line of best fit can not be found. For example, one would find a strong positive correlation between the height of a flood and the number of workers helping to stop the water. Would it be reasonable to conclude the workers cause the flood waters to rise? Of course, the most likely explanation is that more workers join to help as the height of the water and the likelihood of damage increases. The likelihood of damage is an external variable that was forgotten in this example. Even though this explanation is fairly obvious, it cannot be determined by multiple regression alone.
Two practical considerations that should be noted are the choice of the number of variables and the importance of residual analysis. The number of variables used can significantly affect the outcome of the regression model. Each variable must be chosen carefully to make certain that the variable is not extraneous. Also, with the addition of more variables more observations need to be obtained. It is generally agreed that one should have at least 10 to 20 times as many observations as one has variables, otherwise the estimates of the regression line are probably unstable and unlikely to replicate if the study was to be repeated.
Statisticians interested in testing multiple independent variables sometimes perform several regressions based on the following models:
y = ß0 + ß1x1 + e y = ß0 + ß2x2 + e y = ß0 + ß3x3 + e Here, y represents the variable to be forecast, or the response variable. The x variables represent the variables which the predictions are going to be based. These are
Multiple Regression 4 often called the explanatory variables. Beta (ß), in this case, is the coefficient before each of the explanatory variables.
It is possible for the above to be used, but it is also possible that the independent variables could obscure each other’s effects. For example, an animal’s mass could be a function of both age and diet. The age effect might override the diet effect, leading to a regression for diet which would not appear very interesting. One possible solution is the model for multiple regression which allows for the simultaneous testing and modeling of multiple independent variables. The standard notation for expressing linear relationships among more then two variables takes the form:
y = ß0 + ß1x1 + ß2x2 + …+ ßkxk + e
And we wish to estimate the ß0, ß1, ß2, etc. by obtaining:
ŷ = b0 + b1x1 + b2x2 + b3x3 + ...
The b’s are termed the “regression coefficients”. Instead of fitting a line to the data, this can be thought of as fitting a plane (for two independent variables) or space (for three independent variables). Figure 1 shows how a plane can be used to describe a set of data in a three dimensional space. The estimation can still be done according to the principles of linear least squares.
Multiple Regression 5
The least-squares criterion says that the sum of [ŷ - b0 + b1x1 + b2x2 + b3x3 + ...]2 should be found as small as possible. This can be thought of as trying to make each of the deviations for the x value from the least square line as small as possible. However, this is not a linear scale. It must be thought of in several dimensions. Here is where the computer comes in handy.
The fit of a least-squares regression can be described with the coefficient of multiple determination. This coefficient has essentially the same meaning as the coefficient of determination. If it were .974, this means that about 97.4% of the variation of the response variable can be explained from the least-squares equation and the corresponding joint variation of each of the explanatory variables taken together. The remaining 2.6% of the variation is due to random chance or the presence of other variables not included in this regression equation.
Multiple Regression 6 One can test a coefficient for significance using a simple t-test, specifically testing the null hypothesis that the regression coefficient is zero (H0: ßi =0). The equation for this test would follow: bi – ßi t= Si
Where bi is the coefficient found, ßi is the theoretical coefficient, and Si is the standard error for the constant. The Student’s t distribution should have degrees of freedom d.f. = n – k – 1, where n is the number of data points and k is the number of explanatory variables in the least-squares equation.
Finding confidence intervals is determined quite the same way:
bi – t Si < ßi < bi + tSi
Using the degrees of freedom described above to find the t value, a c% confidence interval can be found.
Attached is a copy of the source code for finding the variables required to compute the regression model. Without understanding the c++ development language, it is difficult to explain how the code works. However, the comments (denoted by a prefix of “// “or enclosed by “/*” and “*/”) provide the code with internal documentation. Compiling this code shows a result of this format:
Multiple Regression 7
Regression Statistics -------------------------------------------------Multiple R = R Square = Adjusted R square = Standard Error = Durbin-Watson = ANOVA TABLE df SS MS F -----------------------------------------------------Regression Residual Total Coefficient Standard Error t-stat ----------------------------------------------------------Intercept Beta [i] -
The code is easy to read, and provides a functional output and much more efficient then performing the computations by hand. A copy of the compiled program can be obtained by e-mailing firstname.lastname@example.org.
One can see that using multiple regression provides an accurate equation to represent a set of data. Using several variables decreases the chance for error, and increases the reliability of the equation itself. Assumptions need to be made before utilizing this tool and there are several considerations that should be taken into account before designing an experiment that will use this type of regression. Computing the regression equation is often difficult by hand and is best left to a computer to calculate. The advantages of using multiple regression are numerous. If more relevant data is included, a more accurate and reliable prediction can be made.
Multiple Regression 8 References
Brase, Charles. Understandable Statistics. Massachusetts: D. C. Heath and Company, 1995.
Chun, Ka. (1999). C++ Source Code. University of Hong Kong. 26 May 2003. <http://web.hku.hk/~h9802783/cpp/regression.html>.
Garson, Dave. (2003). Multiple Regression. North Carolina State University. 26 May 2003. <http://www2.chass.ncsu.edu/garson/pa765/regress.htm>.
Osborne, Jason W. (2000). Prediction in multiple regression. Practical Assessment, Research & Evaluation, 7(2). 26 May 2003. <http://ericae.net/pare/getvn.asp?v=7&n=2>.
Palmer, Michael. (2001). Multiple Regression. Oklahoma State University. 26 May 2003. <http://www.okstate.edu/artsci/botany/ordinate/MULTIPLE.htm>.
StatSoft, Inc. (2003). Multiple Regression. 26 May 2003. <http://www.statsoftinc.com/textbook/stmulreg.html>.