VIEWS: 3 PAGES: 25 POSTED ON: 2/29/2012
Regression Analysis Chapter 10 Regression and Correlation Techniques that are used to establish whether there is a mathematical relationship between two or more variables, so that the behavior of one variable can be used to predict the behavior of others. Applicable to “Variables” data only. • “Regression” provides a functional relationship (Y=f(x)) between the variables; the function represents the “average” relationship. • “Correlation” tells us the direction and the strength of the relationship. analysis starts with a Scatter Plot Y vs X TheThe analysis starts with a ScatterPlot ofof Y vs X. 2 Simple Linear Regression What is it? Determines if Y depends on X and provides a math equation for the y relationship (continuous data) Examples: x Process conditions and product properties Does Y depend on X? Sales and advertising budget Which line is correct? 3 Simple Linear Regression rise m = slope = run Y b = Y intercept rise = the Y value at point that the line intersects Y run axis. b 0 X A simple linear relationship can be described mathematically by Y = mX + b 4 Simple Linear Regression (6 - 3) 1 rise = = slope = run (10 - 4) 2 Y 5 rise run intercept = 1 0 X 0 5 10 Y = 0.5X + 1 Simple regression example An agent for a residential real estate company in a large city would like to predict the monthly rental cost for apartments based on the size of the apartment as defined by square footage. A sample of 25 apartments in a particular residential neighborhood was selected to gather the information 6 Size Rent 850 950 1450 1600 1085 1200 1232 1500 718 950 1485 1700 1136 1650 The data on size 726 935 700 875 956 1150 and rent for the 1100 1400 1285 1650 1985 2300 25 apartments 1369 1800 1175 1400 1225 1450 will be analyzed 1245 1100 1259 1700 1150 1200 in EXCEL. 896 1150 1361 1600 1040 1650 755 1200 1000 800 1200 1750 7 Scatter plot 2500 2300 2100 1900 1700 Rent 1500 1300 1100 900 700 500 500 700 900 1100 1300 1500 1700 1900 2100 Size Scatter plot suggests that there is a ‘linear’ relationship between Rent and Size 8 Interpreting EXCEL output SUMMARY OUTPUT Regression Statistics Multiple R 0.85 R Square 0.72 Adjusted R Square 0.71 Standard Error 194.60 Observations 25 ANOVA df SS MS F Significance F Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08 Residual 23 870949.4547 37867.3676 Total 24 3139726 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184 Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350 Regression Equation Rent = 177.121+1.065*Size 9 Interpretation of the regression coefficient What does the coefficient of Size mean? For every additional square feet, Rent goes up by $1.065 10 Using regression for prediction Predict monthly rent when apartment size is 1000 square feet: Regression Equation: Rent = 177.121+1.065*Size Thus, when Size=1000 Rent=177.121+1.065*1000=$1242 (rounded) 11 Using regression for prediction – Caution! Regression equation is valid only over the range over which it was estimated! We should interpolate Do not use the equation in predicting Y when X values are not within the range of data used to develop the equation. Extrapolation can be risky Thus, we should not use the equation to predict rent for an apartment whose size is 500 square feet, since this value is not in the range of size values used to create the regression equation. 12 Why extrapolation is risky Extrapolated relationship True Relationship In this figure, we fit our regression model using Sample Data sample data – but the linear 2.5 4.0 relation implicit in our regression model does not hold outside our sample! By extrapolating, we are making erroneous estimates! 13 Correlation (r) “Correlation coefficient”, r, is a measure of the strength and the direction of the relationship between two variables. Values of r range from +1 (very strong direct relationship), through “0” (no relationship), to –1 (very strong inverse relationship). It measures the degree of scatter of the points around the “Least Squares” regression line 14 Coefficient of correlation from EXCEL SUMMARY OUTPUT Regression Statistics Multiple R 0.85 R Square 0.72 Adjusted R Square 0.71 Standard Error 194.60 Observations 25 ANOVA df SS MS F Significance F Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08 Residual 23 870949.4547 37867.3676 Total 24 3139726 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184 Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350 The sign of r is the same as that of the coefficient of X (Size) in the regression equation (in our case the sign is positive). Also, if you look at the scatter plot, you will note that the sign should be positive. R=0.85 suggests a fairly ‘strong’ correlation between size and rent. 15 Coefficient of determination (r2) “Coefficient of Determination”, r-squared, (sometimes R- squared), defines the amount of the variation in Y that is attributable to variation in X 16 Getting r2 from EXCEL SUMMARY OUTPUT Regression Statistics Multiple R 0.85 R Square 0.72 Adjusted R Square 0.71 Standard Error 194.60 Observations 25 ANOVA df SS MS F Significance F Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08 Residual 23 870949.4547 37867.3676 Total 24 3139726 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184 Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350 It is important to remember that r-squared is always positive. It is the square of the coefficient of correlation r. In our case, r2=0.72 suggests that 72% of variation in Rent is explained by the variation in Size. The higher the value of r2, the better is the simple regression model. 17 Standard error (SE) Standard error measures the variability or scatter of the observed values around the regression line. 2100 1900 1700 Rent ($) 1500 1300 1100 900 700 500 500 1000 1500 2000 2500 Size (square feet) 18 Getting the standard error (SE) from EXCEL SUMMARY OUTPUT Regression Statistics Multiple R 0.85 R Square 0.72 Adjusted R Square 0.71 Standard Error 194.60 Observations 25 ANOVA df SS MS F Significance F Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08 Residual 23 870949.4547 37867.3676 Total 24 3139726 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184 Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350 In our example, the standard error associated with estimating rent is $194.60. 19 Is the simple regression model statistically valid? It is important to test whether the regression model developed from sample data is statistically valid. For simple regression, we can use 2 approaches to test whether the coefficient of X is equal to zero 1. using t-test 2. using ANOVA 20 Is the coefficient of X equal to zero? In both cases, the hypothesis we test is: H 0 : Slope 0 H1 : Slope 0 What could we say about the linear relationship between X and Y if the slope were zero? 21 Using coefficient information for testing if slope=0 SUMMARY OUTPUT Regression Statistics Multiple R 0.85 P-value R Square 0.72 7.52E-08 Adjusted R Square 0.71 Standard Error 194.60 =7.52*10-8 Observations 25 =0.0000000752 ANOVA df SS MS F Significance F Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08 Residual 23 870949.4547 37867.3676 Total 24 3139726 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184 Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350 t-stat=7.740 and P-value=7.52E-08. P-value is very small. If it is smaller than our a level, then, we reject null; not otherwise. If a=0.05, we would reject null and conclude that slope is not zero. Same result holds at a=0.01 because the P- value is smaller than 0.01. Thus, at 0.05 (or 0.01) level, we conclude that the slope is NOT zero implying that our model is statistically valid. 22 Using ANOVA for testing if slope=0 in EXCEL SUMMARY OUTPUT Regression Statistics Multiple R 0.85 R Square 0.72 Adjusted R Square 0.71 Standard Error 194.60 Observations 25 ANOVA df SS MS F Significance F Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08 Residual 23 870949.4547 37867.3676 Total 24 3139726 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184 Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350 F=59.91376 and P-value=7.51833E-08. P-value is again very small. If it is smaller than our a level, then, we reject null; not otherwise. Thus, at 0.05 (or 0.01) level, slope is NOT zero implying that our model is statistically valid. This is the same conclusion we reached using the t-test. 23 Confidence interval for the slope of Size SUMMARY OUTPUT Regression Statistics Multiple R 0.85 R Square 0.72 Adjusted R Square 0.71 Standard Error 194.60 Observations 25 ANOVA df SS MS F Significance F Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08 Residual 23 870949.4547 37867.3676 Total 24 3139726 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184 Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350 The 95% CI tells us that for every 1 square feet increase in apartment Size, Rent will increase by $0.78 to $1.35. 24 Summary Simple regression is a statistical tool that attempts to fit a straight line relationship between X (independent variable) and Y (dependent variable) The scatter plot gives us a visual clue about the nature of the relationship between X and Y EXCEL, or other statistical software is used to ‘fit’ the model; a good model will be statistically valid, and will have a reasonably high R-squared value A good model is then used to make predictions; when making predictions, be sure to confine them within the domain of X’s used to fit the model (i.e. interpolate); we should avoid extrapolation 25