VIEWS: 3 PAGES: 71 POSTED ON: 9/7/2011 Public Domain
CHAPTER 16 MULTIPLE REGRESSION AND CORRELATION SECTION EXERCISES 16.1 d/p/e Simple linear regression involves only one independent variable; multiple regression involves two or more independent variables. Multiple regression analysis is preferred whenever two or more variables impact upon the dependent variable. 16.2 d/p/e As with simple regression analysis, multiple regression analysis is used in determining and interpreting the linear relationship between the dependent and independent variables. Correlation analysis measures the strength of the relationship. 16.3 d/p/m Many variables could affect the annual household expenditure for auto maintenance and repair: the number of cars owned, the number of miles driven each year, the age(s) of the car(s), the make(s) of the car(s). These are just a few of the many variables that could have a notable effect. 16.4 d/p/m The director may wish to examine the personnel file for the following variables: the number of vacation days taken last year, the number of personal days taken last year, the times late to work last year, the number of conferences scheduled with the employee's superior, and the number of days called in sick the previous year. 16.5 d/p/e The multiple regression model is: yi = 0 + 1x1i + 2x2i + ... + kxki + i where yi = a value of the dependent variable, y 0 = a constant x1i, x2i, ... , xki = values of the independent variables x1, x2, ... , xk 1, 2, ... , k = partial regression coefficients for independent variables x1, x2, ... , xk i = random error, or residual 16.6 d/p/m In terms of the residual component of the model, the assumptions underlying multiple regression are: 1. For any given set of values for the independent variables, the population of residuals will be normally distributed with a mean of zero and a standard deviation of . 2. The standard deviation of the error terms is the same regardless of the combination of values taken on by the independent variables. 3. The error terms are statistically independent from each other. 16.7 d/p/m When there are two independent variables, the regression equation can be thought of in terms of a geometric plane. When there are three or more independent variables, the regression equation becomes a mathematical entity called a hyperplane; it is impossible to visually summarize a regression with three or more independent variables because it will be in four or more dimensions. 16.8 c/a/e a. The y-intercept, or constant term, is 100. The partial regression coefficient for x1 is 20; for x2, -3; and, for x3, 120. ˆ b. The estimated value of y is y = 100 + 20(12) - 3(5) + 120(10) = 1525. ˆ c. If x3 were to increase by 4, the value of y would increase by 480. To offset this increase, x2 would have to increase by 160, or 480/3. 523 16.9 p/a/e a. The y-intercept is 300, the partial regression coefficients are 7 for x1 and 13 for x2. ˆ b. If 3 people live in a 6-room home, the estimated bill is y = 300 + 7(3) + 13(6) = 399. 16.10 p/a/e a. The y-intercept or constant term is -0.1; this is the estimated total operating cost (in millions of dollars) when there is no labor cost and no power cost. (Note: it is very unlikely that a plant ever operates without incurring either labor or power costs; this estimate is very suspect. We must be careful when making estimates based on x values that lie beyond the range of the underlying data.) The partial regression coefficient for the labor cost is 1.1; this indicates that, for a given level of electric cost, the estimated operating cost will increase by $1.10 for each additional $1 incurred in labor costs. The partial regression coefficient for the electric power cost is 2.8; this indicates that, for a given level of labor cost, the estimated operating cost will increase by $2.80 for each additional $1 increase in electric power cost. b. If labor costs $6 million and electric power costs $0.3 million, the estimated annual cost to operate the ˆ plant is: y = -0.1 + 1.1(6) + 2.8(0.3) = $7.34 million. 16.11 p/c/m The Minitab printout is shown below. Regression Analysis: Visitors versus AdSize, Discount The regression equation is Visitors = 10.7 + 2.16 AdSize + 0.0416 Discount Predictor Coef SE Coef T P Constant 10.687 3.875 2.76 0.040 AdSize 2.1569 0.6281 3.43 0.019 Discount 0.04157 0.04380 0.95 0.386 S = 3.375 R-Sq = 71.6% R-Sq(adj) = 60.3% Analysis of Variance Source DF SS MS F P Regression 2 143.92 71.96 6.32 0.043 Residual Error 5 56.95 11.39 Total 7 200.87 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 24.59 1.74 ( 20.12, 29.06) ( 14.83, 34.35) Values of Predictors for New Observations New Obs AdSize Discount 1 5.00 75.0 a. The regression equation is Visitors = 10.687 + 2.1569*AdSize + 0.04157*Discount. b. The y-intercept indicates that about 10 or 11 visitors (10.687) would come to the clubs if there were neither ads nor discounts. The partial regression coefficient for the ad data indicates that, holding the level of the discount constant, increasing the ad size by one column inch will bring in about 2 new visitors (2.1569). Finally, the partial regression coefficient for the discount data indicates that, holding the size of the ad constant, an additional $1 discount will add 0.04157 to the number of visitors. c. If the size of the ad is 5 column-inches and a $75 discount is offered, the estimated number of new visitors to the club is 24.59. See the "Fit" column in the printout. 524 The corresponding Excel multiple regression printout is shown below. A B C D E F G 14 SUMMARY OUTPUT Visitors Col-Inches Discount 15 Regression Statistics 23 4 100 16 Multiple R 0.8465 30 7 20 17 R Square 0.7165 20 3 40 18 Adjusted R Square 0.6031 26 6 25 19 Standard Error 3.3749 20 2 50 20 Observations 8 18 5 30 21 17 4 25 22 ANOVA 31 8 80 23 df SS MS F Significance F 24 Regression 2 143.924 71.962 6.318 0.043 25 Residual 5 56.951 11.390 26 Total 7 200.875 27 28 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 29 Intercept 10.687 3.875 2.758 0.040 0.726 20.648 30 Col-Inches 2.157 0.628 3.434 0.019 0.542 3.771 31 Discount 0.042 0.044 0.949 0.386 -0.071 0.154 16.12 p/c/m The Minitab printout is shown below. Regression Analysis: Overall versus Ride, Handling, Comfort The regression equation is Overall = 35.6 + 3.68 Ride + 2.89 Handling - 0.11 Comfort Predictor Coef SE Coef T P Constant 35.63 13.42 2.66 0.045 Ride 3.675 1.639 2.24 0.075 Handling 2.892 1.055 2.74 0.041 Comfort -0.110 1.625 -0.07 0.949 S = 2.858 R-Sq = 75.6% R-Sq(adj) = 61.0% Analysis of Variance Source DF SS MS F P Regression 3 126.714 42.238 5.17 0.054 Residual Error 5 40.842 8.168 Total 8 167.556 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 82.937 2.493 ( 76.529, 89.345) ( 73.188, 92.686) Values of Predictors for New Observations New Obs Ride Handling Comfort 1 6.00 9.00 7.00 a. The regression equation is: Overall = 35.63 + 3.675*Ride + 2.892*Handling - 0.110*Comfort b. The y-intercept indicates that a car that scores 0 on all three of the independent variables will receive an overall rating of 35.63. (This result should be considered cautiously since there were no 0 scores in the data used to estimate the regression.) The partial regression coefficient for Ride indicates that, holding the other two scores constant, an additional point in Ride will result in an overall rating that is 3.675 points higher. The partial regression coefficient for Handling indicates that, holding the other two scores constant, an additional point in Handling will result in an overall rating that is 2.892 points higher. The partial regression coefficient for Comfort indicates that, holding the other two scores constant, an additional point in Comfort will result in an overall rating that is 0.110 points lower. c. The estimated overall rating for a vehicle that scores 6 on Ride, 9 on Handling, and 7 on Comfort is 82.937. This can be calculated as 35.63 + 3.675(6) + 2.892(9) - 0.110(7). In the Minitab printout, refer to the "Fit" column. 525 The corresponding Excel multiple regression printout is shown below. A B C D E F G 13 Rating Ride Handling Comfort 14 83 8 7 7 15 SUMMARY OUTPUT 86 8 8 8 16 Regression Statistics 83 6 8 7 17 Multiple R 0.86963 83 8 7 9 18 R Square 0.75625 95 9 9 9 19 Adjusted R Square 0.61000 84 8 8 9 20 Standard Error 2.85803 88 9 6 9 21 Observations 9 82 7 8 7 22 92 8 9 8 23 ANOVA 24 df SS MS F Significance F 25 Regression 3 126.7139 42.2380 5.1709 0.0543 26 Residual 5 40.8416 8.1683 27 Total 8 167.5556 28 29 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 30 Intercept 35.62642 13.41832 2.65506 0.04515 1.13360 70.11924 31 Ride 3.67543 1.63891 2.24260 0.07497 -0.53752 7.88838 32 Handling 2.89205 1.05540 2.74024 0.04078 0.17907 5.60502 33 Comfort -0.11009 1.62469 -0.06776 0.94860 -4.28648 4.06631 16.13 p/c/m The Minitab printout is shown below. Regression Analysis: Crispness versus OvenTime, Temp The regression equation is Crispness = - 127 + 7.61 OvenTime + 0.357 Temp Predictor Coef SE Coef T P Constant -127.19 61.33 -2.07 0.072 OvenTime 7.611 3.873 1.97 0.085 Temp 0.3567 0.1177 3.03 0.016 S = 15.44 R-Sq = 58.6% R-Sq(adj) = 48.2% Analysis of Variance Source DF SS MS F P Regression 2 2696.4 1348.2 5.65 0.029 Residual Error 8 1907.3 238.4 Total 10 4603.6 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 -17.79 29.01 ( -84.69, 49.11) ( -93.57, 57.99) XX X denotes a row with X values away from the center XX denotes a row with very extreme X values Values of Predictors for New Observations New Obs OvenTime Temp 1 5.00 200 a. The regression equation is Crispness = -127.19 + 7.611*OvenTime + 0.3567*Temp. b. The y-intercept indicates that a crust that is not cooked will receive a crispness rating of -127.19. (Caution should be used in interpreting this value since there were no such extreme values in the data used to estimate the regression.) The partial regression coefficient for OvenTime indicates that, for a given temperature, an additional minute in the oven will add 7.611 points to the crispness rating. Likewise, the partial regression coefficient for Temp indicates that, for a given cooking time, a one-degree increase in the oven temperature will result in a 0.3567 increase in the crispness rating. c. The estimated crispness rating for a pie that is cooked 5 minutes at 200 degrees is -17.79. See the "Fit" column of the Minitab printout or substitute OvenTime = 5 and Temp = 200 into the regression equation. This estimate should be viewed cautiously since the oven temperature is well beyond the limits of the data used to estimate the regression. 526 The corresponding Excel multiple regression printout is shown below. A B C D E F G 12 Crispness Time Temp. 13 68 6.0 460 14 76 8.9 430 15 SUMMARY OUTPUT 49 8.8 360 16 Regression Statistics 99 7.8 460 17 Multiple R 0.7653 90 7.3 390 18 R Square 0.5857 32 5.3 360 19 Adjusted R Square 0.4821 96 8.8 420 20 Standard Error 15.4405 77 9.0 350 21 Observations 11 94 8.0 450 22 82 8.2 400 23 ANOVA 97 6.4 450 24 df SS MS F Significance F 25 Regression 2 2696.3635 1348.1817 5.6549 0.0295 26 Residual 8 1907.2729 238.4091 27 Total 10 4603.6364 28 29 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 30 Intercept -127.1896 61.3267 -2.0740 0.0718 -268.6093 14.2300 31 Time 7.6111 3.8732 1.9651 0.0850 -1.3205 16.5428 32 Temp. 0.3567 0.1177 3.0315 0.0163 0.0854 0.6281 16.14 p/c/m The Minitab printout is shown below. Regression Analysis: Budget versus Attend, Acres, Species The regression equation is Budget = - 0.68 + 12.0 Attend + 0.0612 Acres - 0.0154 Species Predictor Coef SE Coef T P Constant -0.681 6.600 -0.10 0.921 Attend 11.956 4.142 2.89 0.028 Acres 0.06115 0.03343 1.83 0.117 Species -0.01538 0.01562 -0.98 0.363 S = 4.914 R-Sq = 77.7% R-Sq(adj) = 66.6% Analysis of Variance Source DF SS MS F P Regression 3 506.06 168.69 6.99 0.022 Residual Error 6 144.88 24.15 Total 9 650.94 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 23.18 2.97 ( 15.90, 30.45) ( 9.12, 37.23) Values of Predictors for New Observations New Obs Attend Acres Species 1 2.00 150 600 a. The regression equation is Budget = -0.681 + 11.956*Attend + 0.06115*Acres - 0.01538*Species. b. The y-intercept indicates that a city zoo that has 0 attendance, occupies 0 acres and features 0 species will have an annual budget of -0.681 million dollars. Naturally, there is no such zoo, and this result should be considered cautiously since there were no 0 scores in the data used to estimate the regression equation. The partial regression coefficient for Attend indicates that, holding the other two independent variables constant, an additional 1 million in attendance will raise the estimated budget by $11.956 million. The partial regression coefficient for Acres indicates that, holding the other two independent variables constant, a 1-acre increase in space will increase the estimated budget by $0.06115 million. The partial regression coefficient for Species indicates that, holding the other two independent variables constant, bringing 1 additional species of animal into the park will decrease the estimated budget by $0.01538 million. 527 c. The estimated annual budget for a zoo that has 2.0 million annual attendance, occupies 150 acres, and has 600 animal species is $23.18 million. See the "Fit" column in the Minitab printout or substitute Attend = 2, Acres = 150, and Species = 600 into the regression equation. The corresponding Excel multiple regression printout is shown below. A B C D E F G 13 Budget Attend Acres Species 14 14.5 0.6 210 271 15 SUMMARY OUTPUT 35.0 2.0 216 400 16 Regression Statistics 6.9 0.4 70 377 17 Multiple R 0.8817 9.0 1.0 125 277 18 R Square 0.7774 6.6 1.5 55 721 19 Adjusted R Square 0.6662 17.2 1.3 80 400 20 Standard Error 4.9139 15.5 1.3 42 437 21 Observations 10 21.0 2.5 91 759 22 12.0 0.9 125 270 23 ANOVA 9.6 1.1 92 260 24 df SS MS F Significance F 25 Regression 3 506.0640 168.6880 6.9861 0.0220 26 Residual 6 144.8770 24.1462 27 Total 9 650.9410 28 29 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 30 Intercept -0.68145 6.60013 -0.10325 0.9211 -16.8314 15.4685 31 Attend 11.95568 4.14185 2.88655 0.0278 1.8209 22.0904 32 Acres 0.06115 0.03343 1.82928 0.1171 -0.0206 0.1429 33 Species -0.01538 0.01562 -0.98430 0.3630 -0.0536 0.0229 16.15 p/a/m a. The multiple regression equation is TEST04 = 11.98 + 0.2745*TEST01 + 0.37619*TEST02 + 0.32648*TEST03 The y-intercept indicates that an individual unit scoring 0 on the first three tests can expect to score 11.98 on the fourth test. (However, this is meaningless, since the test scores range from 200 to 800.) The partial regression coefficient for TEST01 indicates that, for a given set of scores on TEST02 and TEST03, a unit will gain 0.2745 points on TEST04 for an additional point on TEST01. Similarly, the partial regression coefficient for TEST02 implies, for a given set of scores on TEST01 and TEST03, a unit will gain 0.37619 points on TEST04 for an additional point on TEST02. Likewise, the partial regression coefficient for TEST03 indicates an improvement of 0.32648 points on TEST04 for each additional point on TEST03, given a set of scores for TEST01 and TEST02. b. If an individual unit has scored 350, 400, and 600 on the first three tests, its estimated score on the fourth test is: TEST04 = 11.98 + 0.2745(350) + 0.37619(400) + 0.32648(600) = 454.419. 16.16 p/a/m a. First, we must determine the midpoint of the approximate 90% confidence interval: ˆ y = 11.98 + 0.2745(300) + 0.37619(500) + 0.32648(400) = 413.017 From the printout, we see that the multiple standard error of the estimate is 52.72, and we know that n = 12. With d.f. = 12 - 3 - 1 = 8, the appropriate t-value is 1.860. The approximate 90% confidence interval for the mean rating on test four for units that have been rated at 300, 500, and 400 on the first three tests is: s 52.72 y t e 413.017 1.860 ˆ 413.017 28.307 (384.710, 441.324) n 12 b. The approximate 90% prediction interval is: y tse = 413.017 1.860(52.72) = 413.017 98.059 = (314.958, 511.076) ˆ 528 16.17 c/a/m ˆ a. The mean of y is y = 5.0 + 1.0(25) + 2.5(40) = 130.0. 173.5 b. The multiple standard error of the estimate is se 3.195 . 20 2 1 c. The approximate 95% confidence interval for the mean of y whenever x1 = 20 and x2 = 30 can be found in several steps. First, we must find the midpoint of the approximate confidence interval. ˆ This will be y = 5.0 + 1.0(20) + 2.5(30) = 100. The degrees of freedom are 20 - 2 - 1 = 17. The appropriate t-value is 2.110. The approximate confidence interval for the mean of y is: s 3.195 y t e 100 2.110 ˆ 100 1.507 (98.493, 101.507) n 20 d. The approximate 95% prediction interval for an individual y value when x1 = 20 and x2 = 30 is: y tse = 100 2.110(3.195) = 100 6.741 = (93.259, 106.741) ˆ 16.18 p/a/m The solution can be obtained with formulas and calculator, but we will use Minitab and the printout below: Regression Analysis: Rating versus Price, Perform, BattLife The regression equation is Rating = 65.0 - 0.00606 Price + 0.160 Perform + 1.25 BattLife Predictor Coef SE Coef T P Constant 64.98 19.54 3.33 0.029 Price -0.006056 0.003189 -1.90 0.130 Perform 0.1601 0.1711 0.94 0.402 BattLife 1.250 2.277 0.55 0.612 S = 2.629 R-Sq = 59.3% R-Sq(adj) = 28.8% Analysis of Variance Source DF SS MS F P Regression 3 40.347 13.449 1.95 0.264 Residual Error 4 27.653 6.913 Total 7 68.000 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 78.058 1.795 ( 73.073, 83.042) ( 69.218, 86.897) Values of Predictors for New Observations New Obs Price Perform BattLife 1 1000 100 2.50 a. The regression equation is Rating = 65.0 - 0.00606*Price + 0.160*Perform + 1.25*BattLife. For the population of computers that have a $1000 street price, a performance score of 100, and a 2.50-hour battery life, we are 95% confident that the mean rating of such computers will be within the interval from 73.073 to 83.042. b. For an individual computer with a $1000 street price, a performance score of 100, and a 2.50-hour battery life, we are 95% confident that the rating for this particular computer will be within the interval from 69.218 to 86.897. 529 16.19 p/c/m The solution can be obtained with formulas and calculator, but we will use Minitab and the printout below: Regression Analysis: CalcFin versus MathPro, SATQ The regression equation is CalcFin = - 26.6 + 0.776 MathPro + 0.0820 SATQ Predictor Coef SE Coef T P Constant -26.62 17.18 -1.55 0.172 MathPro 0.7763 0.1465 5.30 0.002 SATQ 0.08202 0.02699 3.04 0.023 S = 4.027 R-Sq = 88.5% R-Sq(adj) = 84.7% Analysis of Variance Source DF SS MS F P Regression 2 751.57 375.78 23.17 0.002 Residual Error 6 97.32 16.22 Total 8 848.89 Predicted Values for New Observations New Obs Fit SE Fit 90.0% CI 90.0% PI 1 68.74 2.43 ( 64.01, 73.46) ( 59.59, 77.88) Values of Predictors for New Observations New Obs MathPro SATQ 1 70.0 500 a. The regression equation is CalcFin = -26.6 + 0.776*MathPro + 0.0820*SATQ. For the population of entering freshmen who scored 70 on the math proficiency test and 500 on the quantitative portion of the SAT exam, we are 90% confident that their mean calculus final exam score will be within the interval from 64.01 to 73.46. b. For an individual entering freshman who scored 70 on the math proficiency test and 500 on the quantitative portion of the SAT exam, we are 90% confident that his or her calculus final exam score will be within the interval from 59.59 to 77.88. 16.20 p/a/m a. Using only the regression equation and summary information obtained in exercise 16.11, we can determine the approximate 95% confidence interval for the mean number of new visitors for clubs using 5 column-inches ads and offering an $80 discount. First, the midpoint of the interval will be ˆ y = 10.687 + 2.1569(5) + 0.04157(80) = 24.797. Eight observations were used to estimate the regression, so d.f. = 8 - 2 - 1 = 5. The appropriate t-value is 2.571, the multiple standard error of the estimate is 3.375, and the approximate 95% confidence interval is: s 3.375 y t e 24.797 2.571 ˆ 24.797 3.068 (21.729, 27.865) n 8 b. The corresponding approximate 95% prediction interval is: y tse = 24.797 2.571(3.375) = 24.797 8.677 = (16.120, 33.474) ˆ The preceding are the approximate intervals that could be calculated based only on the information shown in the printouts for exercise 16.11. As discussed in the text, the exact intervals will tend to be wider than the approximate intervals. This is because the exact intervals take into account that the specified values for x1 and x2 may differ from their respective means. The exact Minitab intervals corresponding to parts a and b of this exercise are: 95% confidence interval, (19.91, 29.69); 95% prediction interval, (14.84, 34.76). 530 16.21 p/a/m a. Using only the regression equation and summary information obtained in exercise 16.12, we can determine the approximate 95% confidence interval for the mean overall rating of cars that receive ratings of 8 on ride, 7 on handling, and 9 on driver comfort. First, the midpoint is: ˆ y = 35.63 + 3.675(8) + 2.892(7) - 0.110(9) = 84.284. There were nine observations used to estimate the regression, so d.f. = 9 - 3 - 1 = 5. The appropriate t-value is 2.571, the multiple standard error of the estimate is 2.858, and the approximate 95% confidence interval is: s 2.858 y t e 84.284 2.571 ˆ 84.284 2.449 (81.835, 86.733) n 9 b. The corresponding approximate 95% prediction interval is: y tse = 84.284 2.571(2.858) = 84.284 7.348 = (76.936, 91.632) ˆ The preceding are the approximate intervals that could be calculated based only on the information shown in the printouts for exercise 16.12. As discussed in the text, the exact intervals will tend to be wider than the approximate intervals. This is because the exact intervals take into account that the specified values for x1, x2, and x3 may differ from their respective means. The exact Minitab intervals corresponding to parts a and b of this exercise are: 95% confidence interval, (79.587, 88.980); 95% prediction interval, (75.562, 93.005). 16.22 p/a/m a. Using only the regression equation and summary information obtained in exercise 16.13, we can determine the approximate 95% confidence interval for the mean crispness rating for pies that are cooked 5.0 minutes at 300 degrees. First, the midpoint is: ˆ y = -127.19 + 7.611(5) + 0.3567(300) = 17.875 There were eleven observations used to estimate the regression, so d.f. = 11 - 2 - 1 = 8. The appropriate t-value is 2.306, the multiple standard error of the estimate is 15.44, and the approximate 95% confidence interval is: s 15.44 y t e 17.875 2.306 ˆ 17.875 10.735 (7.140, 28.610) n 11 b. The corresponding approximate 95% prediction interval is: y tse = 17.875 2.306(15.44) = 17.875 35.605 = (-17.730, 53.480) ˆ The preceding are the approximate intervals that could be calculated based only on the information shown in the printouts for exercise 16.13. As discussed in the text, the exact intervals will tend to be wider than the approximate intervals. This is because the exact intervals take into account that the specified values for x1 and x2 may differ from their respective means. The exact Minitab intervals corresponding to parts a and b of this exercise are: 95% confidence interval, (-25.31, 61.07); 95% prediction interval, (-38.10, 73.86). 16.23 d/p/e The coefficient of multiple determination (R2) is analogous to the coefficient of determination in simple linear regression. It is the proportion of variation in y that is explained by the multiple regression equation. 16.24 d/p/m SST is the total variation in the y values, SSR is the variation in the y values that is explained by the regression, and SSE is the variation in the y values that is not explained by the regression. The coefficient of multiple determination is equal to 1 - (SSE/SST), or SSR/SST. If SSE is small compared to SST, SSR will be large compared to SST, and the multiple regression equation will explain a large portion of the variation in y. Recall that SST = SSR + SSE. 531 16.25 d/p/e The coefficient of multiple determination for exercise 16.15 is 0.872. This means that 87.2% of the variation in scores on the fourth test can be explained by variations in scores on the first three tests. 16.26 p/c/m The coefficient of multiple determination for the regression equation obtained in exercise 16.12 is 0.756. This indicates that 75.6% of the variation in overall ratings is explained by the regression equation. 16.27 p/c/m The coefficient of multiple determination for the regression equation obtained in exercise 16.11 is 0.716. This indicates that 71.6% of the variation in the number of new visitors to the club is explained by the regression equation. 16.28 d/p/d Both of these tests will reach the same conclusion. If the confidence interval for 3 does not include zero, the hypothesis test will reject the null hypothesis. On the other hand, if the confidence interval for 3 does contain zero, the hypothesis test will not reject the null hypothesis. 16.29 p/c/m We will base much of our discussion on the Minitab printout for exercise 16.11. The results will be similar if you refer to the Excel printout. a. The appropriate null and alternative hypotheses are: H0: 1 = 2 = 0 and H1: j 0, for j = 1 or 2 From the ANOVA portion of the Minitab printout, we have: Analysis of Variance Source DF SS MS F P Regression 2 143.92 71.96 6.32 0.043 Residual Error 5 56.95 11.39 Total 7 200.87 The p-value for the ANOVA test of the overall significance of the regression equation is 0.043. Since p-value = 0.043 is < = 0.05 level of significance for the test, we reject H0. At this level, there is evidence to suggest that the regression equation is significant. b. From the upper portion of the Minitab printout: The regression equation is Visitors = 10.7 + 2.16 AdSize + 0.0416 Discount Predictor Coef SE Coef T P Constant 10.687 3.875 2.76 0.040 AdSize 2.1569 0.6281 3.43 0.019 Discount 0.04157 0.04380 0.95 0.386 S = 3.375 R-Sq = 71.6% R-Sq(adj) = 60.3% Here we are asked to conduct two hypothesis tests. We will not test the y-intercept since this test is generally not of practical importance. The appropriate null and alternative hypotheses are: Test for 1: H0: 1 = 0 and H1: 1 0 Test for 2: H0: 2 = 0 and H1: 2 0 The p-value for the test of 1 is 0.019. Since p-value = 0.019 is < = 0.05 level of significance for the test, we reject H0. At this level, there is evidence to suggest that 1 is nonzero. The p-value for the test of 2 is 0.386. Since p-value = 0.386 is not < = 0.05 level of significance for the test, we do not reject H0. At this level, there is no evidence to suggest that 1 is nonzero. c. The ANOVA test for the overall regression indicates that the regression explains a significant proportion of the variation in the number of new visitors to the club. The tests for the individual partial regression coefficients indicate that the size of the ad contributes to the explanatory power of the model, while the discount offered does not. 532 d. With d.f. = 8 - 2 - 1 = 5, the appropriate t-value for the 95% confidence interval will be 2.571. The 95% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = 2.1569 2.571(0.6281) = 2.1569 1.6148 = (0.5421, 3.7717) The 95% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = 0.04157 2.571(0.04380) = 0.04157 0.1126 = (-0.0710, 0.1542) With Excel, we can obtain confidence intervals for the population regression coefficients along with the standard regression output. Excel will provide 95% confidence intervals, but we can also specify the inclusion of 90% or any other confidence levels we wish to see. The Excel printout for exercise 16.11 included 95% confidence intervals for 1 and 2. 16.30 p/c/m We will base much of our discussion on the Minitab printout for exercise 16.12. The results will be similar if you refer to the Excel printout. a. The appropriate null and alternative hypotheses are: H0: 1 = 2 = 3 = 0 and H1: j 0, for j = 1, 2, or 3 From the ANOVA portion of the Minitab printout, we have: Analysis of Variance Source DF SS MS F P Regression 3 126.714 42.238 5.17 0.054 Residual Error 5 40.842 8.168 Total 8 167.556 The p-value for the ANOVA test of the overall significance of the regression equation is 0.054. Since p-value = 0.054 is not < = 0.05 level of significance for the test, we do not reject H0. At this level, there is no evidence to suggest that the regression equation is significant. b. From the upper portion of the Minitab printout: The regression equation is Overall = 35.6 + 3.68 Ride + 2.89 Handling - 0.11 Comfort Predictor Coef SE Coef T P Constant 35.63 13.42 2.66 0.045 Ride 3.675 1.639 2.24 0.075 Handling 2.892 1.055 2.74 0.041 Comfort -0.110 1.625 -0.07 0.949 S = 2.858 R-Sq = 75.6% R-Sq(adj) = 61.0% Here we are asked to conduct three hypothesis tests. We will not test the y-intercept since this test is generally not of practical importance. The appropriate null and alternative hypotheses are: Test for 1: H0: 1 = 0 and H1: 1 0 Test for 2: H0: 2 = 0 and H1: 2 0 Test for 3: H0: 3 = 0 and H1: 3 0 The p-value for the test of 1 is 0.075. Since p-value = 0.075 is not < = 0.05 level of significance for the test, we do not reject H0. At this level, there is no evidence to suggest that 1 is nonzero. The p-value for the test of 2 is 0.041. Since p-value = 0.041 is < = 0.05 level of significance for the test, we reject H0. At this level, there is evidence to suggest that 2 is nonzero. The p-value for the test of 3 is 0.949. Since p-value = 0.949 is not < = 0.05 level of significance for the test, we do not reject H0. At this level, there is no evidence to suggest that 3 is nonzero. c. The ANOVA test for the overall regression indicates that the regression does not explain a significant (at the 0.05 level) proportion of the variation in the overall ratings. In only one case, that for 2 533 (associated with handling), does an individual hypothesis test indicate that a population regression coefficient could be nonzero. d. With d.f. = 9 - 3 - 1 = 5, the appropriate t-value for the 95% confidence interval will be 2.571. The 95% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = 3.675 2.571(1.639) = 3.675 4.214 = (-0.54, 7.89) The 95% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = 2.892 2.571(1.055) = 2.892 2.712 = (0.18, 5.60) The 95% confidence interval for population partial regression coefficient 3 is: b3 t s b3 = -0.110 2.571(1.625) = -0.110 4.178 = (-4.29, 4.07) With Excel, we can obtain confidence intervals for the population regression coefficients along with the standard regression output. Excel will provide 95% confidence intervals, but we can also specify the inclusion of 90% or any other confidence levels we wish to see. The Excel printout for exercise 16.12 already included 95% confidence intervals for 1, 2, and 3. Here is a repeat of the lower portion of that Excel printout: A B C D E F G 23 ANOVA 24 df SS MS F Significance F 25 Regression 3 126.7139 42.2380 5.1709 0.0543 26 Residual 5 40.8416 8.1683 27 Total 8 167.5556 28 29 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 30 Intercept 35.62642 13.41832 2.65506 0.04515 1.13360 70.11924 31 Ride 3.67543 1.63891 2.24260 0.07497 -0.53752 7.88838 32 Handling 2.89205 1.05540 2.74024 0.04078 0.17907 5.60502 33 Comfort -0.11009 1.62469 -0.06776 0.94860 -4.28648 4.06631 16.31 p/a/m To determine the 90% confidence interval for each partial regression coefficient in exercise 16.15, we must determine the appropriate t. There are (12 - 3 - 1) = 8 degrees of freedom, so the appropriate t is t =1.860. The 90% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = 0.2745 1.860(0.1111) = 0.2745 ± 0.2066 = (0.0679, 0.4811) This confidence interval does not contain zero, so it is likely the variation in the scores on test 1 do contribute significantly to the explanation of the variation of the scores on test 4. The 90% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = 0.37619 1.860(0.09858) = 0.37619 ± 0.18336 = (0.1928, 0.5596). This confidence interval does not contain zero, so it is likely the variation in the scores on test 2 do contribute significantly to the explanation of the variation of the scores on test 4. The 90% confidence interval for population partial regression coefficient 3 is: b3 t s b3 = 0.32648 1.860(0.08084) = 0.32648 ± 0.15036 = (0.1761, 0.4768) This confidence interval does not contain zero, so it is likely the variation in the scores on test 3 do contribute significantly to the explanation of the variation of the scores on test 4. 534 16.32 p/c/m Referring to the Minitab printout in the solution to exercise 16.18: a. In the ANOVA test for overall significance, p-value = 0.264 is not < 0.10 level of significance, so we conclude that the overall regression is not significant. At this level, all of the population partial regression coefficients could be zero. b. In testing the partial regression coefficients for price, performance, and battery life, the p-values are 0.130, 0.402, and 0.612, respectively. None of these is less than the 0.10 level of significance being used to reach a conclusion. None of the three partial regression coefficients differs significantly from zero. 16.33 p/c/m Referring to the Minitab printout in the solution to exercise 16.19: a. In the ANOVA test for overall significance, p-value = 0.002 is < 0.05 level of significance, so we conclude that the overall regression is significant. b. In testing the partial regression coefficients for math proficiency test score and SAT quantitative score, the p-values are 0.002 and 0.023, respectively. Each p-value is < 0.05 level of significance, and each of the partial regression coefficients differs significantly from zero. 16.34 p/c/m The Minitab printout is shown below. Regression Analysis: Est P/E Rati versus Revenue%Grow, Earn/Share % The regression equation is Est P/E Ratio = 51.7 - 0.103 Revenue%Growth + 0.0143 Earn/Share %Growth 96 cases used 4 cases contain missing values Predictor Coef SE Coef T P Constant 51.73 15.66 3.30 0.001 Revenue% -0.1027 0.2171 -0.47 0.637 Earn/Sha 0.01431 0.07316 0.20 0.845 S = 64.30 R-Sq = 0.3% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 2 986 493 0.12 0.888 Residual Error 93 384502 4134 Total 95 385488 a. The regression equation is Est P/E Ratio = 51.7 - 0.103*Revenue%Growth + 0.0143*Earn/Share %Growth. The partial regression coefficient for revenue growth percentage is -0.103. On average, with earnings/share growth percentage fixed, a one percentage point increase in revenue growth percentage will be accompanied by a decrease of 0.103 in the estimated price/earnings ratio. The partial regression coefficient for earnings/share growth percentage is 0.0143. On average, with revenue growth percentage fixed, a one percentage point increase in earnings/share growth percentage will be accompanied by an increase of 0.0143 in the estimated price/earnings ratio. b. The p-value in the ANOVA section of the printout is 0.888. This is not less than the 0.05 level of significance. At this level, the overall regression equation is not significant. c. The p-values for the tests of the two partial regression coefficients are 0.637 and 0.845, respectively. Neither p-value is less than the 0.05 level of significance, and we conclude that neither partial regression coefficient differs significantly from zero. d. The 95% confidence interval for each partial regression coefficient could be calculated using formulas and pocket calculator, as was demonstrated in the solution to exercise 16.31. We will rely on the Excel printout, shown below. The 95% confidence interval for population partial regression coefficient 1 is from -0.5338 to 0.3285. The 95% confidence interval for population partial regression coefficient 2 is 535 from -0.1310 to 0.1596. (Note: In applying Excel, it is necessary to delete the four cases that have missing data for one or more of these variables.) SUMMARY OUTPUT Regression Statistics Multiple R 0.0506 R Square 0.0026 Adjusted R Square -0.0189 Standard Error 64.2995 Observations 96 ANOVA df SS MS F Significance F Regression 2 986.0007 493.0004 0.1192 0.8877 Residual 93 384501.9576 4134.4297 Total 95 385487.9583 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 51.7305 15.6650 3.3023 0.0014 20.6230 82.8380 Revenue%Growth -0.1027 0.2171 -0.4729 0.6374 -0.5338 0.3285 . Earn/Share %Growth 0.0143 0.0732 0.1955 0.8454 -0.1310 0.1596 16.35 p/c/m The Minitab printout is shown below. The regression equation is $GroupRevenue = -40855482 + 44282 RetailUnits + 152760 NumDealrs Predictor Coef SE Coef T P Constant -40855482 20217627 -2.02 0.046 RetailUn 44282 1290 34.33 0.000 NumDealr 152760 1943687 0.08 0.938 S = 176197662 R-Sq = 99.3% R-Sq(adj) = 99.3% Analysis of Variance Source DF SS MS F P Regression 2 4.46877E+20 2.23439E+20 7197.11 0.000 Residual Error 95 2.94933E+18 3.10456E+16 Total 97 4.49827E+20 a. The regression equation is $GroupRevenue = -40,855,482 + 44,282*RetailUnits + 152,760*NumDealrs The partial regression coefficient for RetailUnits is 44,282. On average, with the number of dealers fixed, an increase of 1 in retail units sold is accompanied by an increase of $44,282 in revenue for the dealer group. The partial regression coefficient for NumDealrs is 152,760. On average, with the number of retail units fixed, an increase of 1 in the number of dealers will be accompanied by an increase of $152,760 in revenue for the dealer group. b. The p-value in the ANOVA section of the printout is (to three decimal places) 0.000. This is less than the 0.02 level of significance. At this level, the overall regression equation is significant. c. The p-values for the tests of the two partial regression coefficients are 0.000 and 0.938, respectively. Using the 0.02 level of significance, the partial regression coefficient for the first independent variable (retail units) is significantly different from zero, but the partial regression coefficient for the second independent variable (number of dealers) does not differ significantly from zero. d. The 98% confidence interval for each partial regression coefficient could be calculated using formulas and pocket calculator, as was demonstrated in the solution to exercise 16.31. We will rely on the Excel printout, shown below. The 98% confidence interval for population partial regression coefficient 1 is 536 from 41,229.4 to 47,333.8. The 98% confidence interval for population partial regression coefficient 2 is from -4,446,472 to 4,751,992. J K L M N O P 1 SUMMARY OUTPUT 2 3 Regression Statistics 4 Multiple R 0.9967 5 R Square 0.9934 6 Adjusted R Square 0.9933 7 Standard Error 176197662 8 Observations 98 9 10 ANOVA 11 df SS MS F Significance F 12 Regression 2 4.469E+20 2.234E+20 7.197E+03 1.960E-104 13 Residual 95 2.949E+18 3.105E+16 14 Total 97 4.498E+20 15 16 Coefficients Standard Error t Stat P-value Lower 98.0% Upper 98.0% 17 Intercept -40855482.0 20217627.3 -2.021 0.046 -88695270.0 6984305.9 18 RetailUnits 44281.6 1289.89 34.330 0.000 41229.4 47333.8 19 NumDealrs 152760.2 1943686.79 0.079 0.938 -4446472 4751992 16.36 d/p/m The normal probability plot is used to examine whether the residuals could have come from a normally distributed population. One of the assumptions underlying multiple regression analysis is that the residuals are normally distributed with a mean of zero. 16.37 d/p/m Residual analysis can be used to examine the residuals with respect to the assumptions underlying multiple regression analysis. We can do many things with residual analysis, including: (1) constructing a histogram of the residuals as a rough check to see if they are approximately normally distributed, (2) constructing a normal probability plot or other normality test to examine whether the residuals could have come from a normally-distributed population, (3) plotting the residuals versus each of the independent variables to see if they exhibit some cycle or pattern with respect to that variable, and (4) plotting the residuals versus the order in which the observations were recorded to look for autocorrelation. 16.38 p/p/m Referring to the printout given in exercise 16.15, we can determine the following: a. The partial regression coefficient for TEST01 is 0.2745. This implies that, holding the scores on tests 2 and 3 constant, a one point increase in the score on test 1 will result in a 0.2745 point increase in the score on test 4. The partial regression coefficient for TEST02 is 0.37619. This implies that, for a given level of scores on tests 1 and 3, a one point increase in the score on test 2 will result in a 0.37619 point increase in test 4. Finally, the partial regression coefficient for TEST03 is 0.32648. This implies that, holding scores on tests 1 and 2 constant, a one point increase in the score on test 3 will result in a 0.32648 point increase in the score on test 4. b. 87.2% of the variation in y is explained by the equation. c. The overall regression is significant at the 0.001 level. d. The p-value for the partial regression coefficient for TEST01 is 0.039; for TEST02, 0.005; and, for TEST03, 0.004. This would indicate that TEST03 contributes the most to the explanation of the variation in scores on test 4; however, TEST02 is almost as useful. TEST01 appears to be least useful, but it is still significant at the 0.05 level. 537 16.39 p/c/m a. The histogram does not reveal any radical departures from a symmetric distribution. Of course, it is difficult to determine this with only eight data points. Histogram of the Residuals (response is Visitors) 2.0 1.5 Frequency 1.0 0.5 0.0 -4 -2 0 2 Residual b. In this Minitab test for normality, the points in the normal probability plot don't appear to deviate excessively from a straight line and the approximate p-value is shown as >0.15. There is nothing here to suggest that the residuals may not have come from a normally distributed population. Probability Plot of RESI1 Normal 99 Mean -5.77316E-15 StDev 2.852 95 N 8 KS 0.184 90 P-Value >0.150 80 70 Percent 60 50 40 30 20 10 5 1 -7.5 -5.0 -2.5 0.0 2.5 5.0 RESI1 538 c. Plots of residuals versus the independent variables. Plot of residuals versus ad size. Residuals Versus AdSize (response is Visitors) 4 3 2 1 Residual 0 -1 -2 -3 -4 -5 2 3 4 5 6 7 8 AdSize Plot of residuals versus discount size. Residuals Versus Discount (response is Visitors) 4 3 2 1 Residual 0 -1 -2 -3 -4 -5 10 20 30 40 50 60 70 80 90 100 Discount The plots above do not reveal any alarming problems. Overall, there is no strong evidence to indicate that any underlying assumptions of the multiple regression model have been violated. 539 16.40 p/c/m a. The histogram does not reveal any radical departures from a symmetric distribution. Of course, there are only 9 points. Histogram of the Residuals (response is Overall) 2.0 1.5 Frequency 1.0 0.5 0.0 -3 -2 -1 0 1 2 3 Residual b. In this Minitab test for normality, the points in the normal probability plot seem to deviate somewhat from the straight line, and the approximate p-value is shown as 0.056. This would seem to raise some suspicions -- however, at the 0.05 level of significance, we would conclude that the residuals could have come from a normally distributed population. Probability Plot of RESI1 Normal 99 Mean -2.68427E-14 StDev 2.259 95 N 9 KS 0.271 90 P-Value 0.056 80 70 Percent 60 50 40 30 20 10 5 1 -5.0 -2.5 0.0 2.5 5.0 RESI1 540 c. Plots of residuals versus the independent variables. Considering the small n, the plots below do not reveal any alarming problems. Overall, there is no strong evidence to indicate that any underlying assumptions of the multiple regression model have been violated. Plot of residuals versus Ride. Residuals Versus Ride (response is Overall) 3 2 1 Residual 0 -1 -2 -3 -4 6.0 6.5 7.0 7.5 8.0 8.5 9.0 Ride Plot of residuals versus Handling. Residuals Versus Handling (response is Overall) 3 2 1 Residual 0 -1 -2 -3 -4 6.0 6.5 7.0 7.5 8.0 8.5 9.0 Handling Plot of residuals versus Comfort. 541 Residuals Versus Comfort (response is Overall) 3 2 1 Residual 0 -1 -2 -3 -4 7.0 7.5 8.0 8.5 9.0 Comfort 16.41 p/c/m a. This histogram does not appear to reveal any radical departures from a symmetric distribution, although there are relatively few data points. Histogram of the Residuals (response is Crispness) 2.0 1.5 Frequency 1.0 0.5 0.0 -18 -12 -6 0 6 12 18 24 Residual b. In this Minitab test for normality, the points in the normal probability plot don't appear to deviate excessively from a straight line and the approximate p-value is shown as >0.15. There is nothing here to suggest that the residuals may not have come from a normally distributed population. Probability Plot of RESI1 Normal 99 Mean 2.583792E-15 StDev 13.81 95 N 11 KS 0.130 90 P-Value >0.150 80 70 Percent 60 50 40 30 20 10 5 1 -40 -30 -20 -10 0 10 20 30 40 RESI1 542 c. Plots of residuals versus the independent variables. The plots below do not reveal any alarming problems. Overall, there is no strong evidence to indicate that any underlying assumptions of the multiple regression model have been violated. Plot of residuals versus time in oven. Residuals Versus OvenTime (response is Crispness) 20 10 Residual 0 -10 -20 5 6 7 8 9 OvenTime Plot of residuals versus oven temperature. Residuals Versus Temp (response is Crispness) 20 10 Residual 0 -10 -20 350 375 400 425 450 475 Temp 543 16.42 p/c/m a. This histogram does not appear to reveal any radical departures from a symmetric distribution, although there are relatively few data points. Histogram of the Residuals (response is Rating) 2.0 1.5 Frequency 1.0 0.5 0.0 -3 -2 -1 0 1 2 3 Residual b. In this Minitab test for normality, the points in the normal probability plot don't appear to deviate excessively from a straight line and the approximate p-value is shown as >0.15. There is nothing here to suggest that the residuals may not have come from a normally distributed population. Probability Plot of RESI1 Normal 99 Mean -3.55271E-15 StDev 1.988 95 N 8 KS 0.170 90 P-Value >0.150 80 70 Percent 60 50 40 30 20 10 5 1 -5.0 -2.5 0.0 2.5 5.0 RESI1 544 c. Plots of residuals versus the independent variables. The plots below do not reveal any alarming problems. Overall, there is no strong evidence to indicate that any underlying assumptions of the multiple regression model have been violated. Plot of residuals versus price. Residuals Versus Price (response is Rating) 3 2 1 Residual 0 -1 -2 -3 1000 1200 1400 1600 1800 2000 Price Plot of residuals versus performance. Residuals Versus Perform (response is Rating) 3 2 1 Residual 0 -1 -2 -3 90 95 100 105 110 Perform Plot of residuals versus battery life. 545 Residuals Versus BattLife (response is Rating) 3 2 1 Residual 0 -1 -2 -3 2.0 2.2 2.4 2.6 2.8 3.0 3.2 BattLife 16.43 p/c/m a. This histogram does not appear to reveal any radical departures from a symmetric distribution, although there are relatively few data points. Histogram of the Residuals (response is CalcFin) 3.0 2.5 2.0 Frequency 1.5 1.0 0.5 0.0 -6 -4 -2 0 2 4 Residual b. In this Minitab test for normality, the points in the normal probability plot don't appear to deviate excessively from a straight line and the approximate p-value is shown as >0.15. There is nothing here to suggest that the residuals may not have come from a normally distributed population. Probability Plot of RESI1 Normal 99 Mean 0 StDev 3.488 95 N 9 KS 0.197 90 P-Value >0.150 80 70 Percent 60 50 40 30 20 10 5 1 -10 -5 0 5 10 RESI1 546 c. Plots of residuals versus the independent variables. The plots below do not reveal any alarming problems. Overall, there is no strong evidence to indicate that any underlying assumptions of the multiple regression model have been violated. Plot of residuals versus math proficiency test. Residuals Versus MathPro (response is CalcFin) 5.0 2.5 Residual 0.0 -2.5 -5.0 70 75 80 85 90 95 MathPro Plot of residuals versus SAT quantitative. Residuals Versus SATQ (response is CalcFin) 5.0 2.5 Residual 0.0 -2.5 -5.0 450 500 550 600 650 SATQ 547 16.44 p/c/m The Minitab printout is shown below. Regression Analysis: Time versus Years, Score The regression equation is Time = 104 - 0.288 Years - 0.679 Score Predictor Coef SE Coef T P Constant 103.85 16.69 6.22 0.000 Years -0.2884 0.3216 -0.90 0.391 Score -0.6792 0.2218 -3.06 0.012 S = 2.862 R-Sq = 53.9% R-Sq(adj) = 44.7% Analysis of Variance Source DF SS MS F P Regression 2 95.757 47.879 5.84 0.021 Residual Error 10 81.935 8.194 Total 12 177.692 a. The multiple regression equation is Time = 103.85 - 0.2884*Years - 0.6792*Score. The partial regression coefficient for years on the job indicates that, for a given score on the aptitude test, the time it takes to perform the standard task decreases by 0.2884 seconds for each additional year on the job. The partial regression coefficient for the test score indicates that, given a set number of years on the job, a one point increase in the test score will result in a 0.6792 second decrease in the amount of time required to perform the required task. b. The appropriate number of degrees of freedom for this problem will be d.f. = 13 - 2 - 1, or 10, and the appropriate t-value for a 95% confidence interval is t = 2.228. The 95% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = -0.2884 2.228(0.3216) = -0.2884 ± 0.7165 = (-1.0049, 0.4281) The 95% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = -0.6792 2.228(0.2218) = -0.6792 ± 0.4942 = (-1.1734, -0.1850) c. The coefficient of multiple determination is 0.539. This indicates that 53.9% of the variation in the time required to complete the task is explained by the regression equation. The partial regression coefficient for Years is significantly different from zero at the 0.391 level. The partial regression coefficient for Score is significantly different from zero at the 0.012 level, and the overall regression equation is significant at the 0.021 level. d. The residual analyses follow. First, the histogram of the residuals is examined to see if it is symmetric about zero. Next the normal probability plot is graphed and the p-value interpreted to examine whether the residuals could have come from a normal population. Finally, the residuals are plotted against each of the independent variables to check for cyclical or other patterns. 548 In the histogram, there does not appear to be any alarming deviation from a symmetric distribution. Histogram of the Residuals (response is Time) 3.0 2.5 2.0 Frequency 1.5 1.0 0.5 0.0 -4.5 -3.0 -1.5 0.0 1.5 3.0 Residual In this Minitab test for normality, the points in the normal probability plot don't appear to deviate excessively from a straight line but the approximate p-value is shown as >0.15. There is nothing here to suggest that the residuals may not have come from a normally distributed population. Probability Plot of RESI1 Normal 99 Mean -1.31177E-14 StDev 2.613 95 N 13 KS 0.136 90 P-Value >0.150 80 70 Percent 60 50 40 30 20 10 5 1 -7.5 -5.0 -2.5 0.0 2.5 5.0 RESI1 The plots of residuals versus the independent variables do not present any alarming patterns. Overall, the residual analysis does not suggest that any of the assumptions underlying multiple regression analysis have been violated. Plot of residuals versus years on job. 549 Residuals Versus Years (response is Time) 4 3 2 1 Residual 0 -1 -2 -3 -4 -5 5 6 7 8 9 10 11 12 13 14 Years Plot of residuals versus test score. Residuals Versus Score (response is Time) 4 3 2 1 Residual 0 -1 -2 -3 -4 -5 70 72 74 76 78 80 82 84 Score The Excel multiple regression solution for the data in this exercise is shown below. Note that Excel already provides 95% confidence intervals for the population regression coefficients. When generating this printout, we can also specify a normal probability plot and plots of the residuals against the independent variables. Their appearance would be essentially similar to those of Minitab. A B C D E F G 16 SUMMARY OUTPUT 17 Regression Statistics 18 Multiple R 0.7341 19 R Square 0.5389 20 Adjusted R Square 0.4467 21 Standard Error 2.8624 22 Observations 13 23 24 ANOVA 25 df SS MS F Significance F 26 Regression 2 95.757 47.879 5.843 0.021 27 Residual 10 81.935 8.194 28 Total 12 177.692 29 30 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 31 Intercept 103.8529 16.6919 6.2217 0.000 66.661 141.045 32 Years -0.2884 0.3216 -0.8966 0.391 -1.005 0.428 33 Score -0.6792 0.2218 -3.0623 0.012 -1.173 -0.185 550 16.45 p/c/m The Minitab printout is shown below. Regression Analysis: Distance versus Price, Sensitiv, Weight The regression equation is Distance = - 0.562 +0.000355 Price + 0.0112 Sensitiv - 0.0212 Weight Predictor Coef SE Coef T P Constant -0.5617 0.8656 -0.65 0.545 Price 0.0003550 0.0005601 0.63 0.554 Sensitiv 0.011248 0.007605 1.48 0.199 Weight -0.02116 0.02471 -0.86 0.431 S = 0.05167 R-Sq = 46.5% R-Sq(adj) = 14.4% Analysis of Variance Source DF SS MS F P Regression 3 0.011590 0.003863 1.45 0.334 Residual Error 5 0.013349 0.002670 Total 8 0.024939 a. The estimated regression equation is: Distance = -0.5617 + 0.0003550*Price + 0.011248*Sensitiv - 0.02116*Weight. The partial regression coefficient for the price indicates that, holding the weight and sensitivity constant, a $1 increase in price will result in a 0.0003550 mile increase in the warning distance. The partial regression coefficient for the sensitivity indicates that, holding the price and weight constant, a one unit increase in sensitivity will result in a 0.011248 mile increase in the warning distance. Finally, the partial regression coefficient for the weight indicates that, holding the price and sensitivity constant, a one ounce increase in weight will result in a 0.02116 mile decrease in the warning distance. b. The appropriate degrees of freedom for this problem will be d.f. = 9 - 3 - 1 = 5. The t-value for a 95% confidence interval with 5 degrees of freedom is t = 2.571. The 95% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = 0.000355 2.571(0.0005601) = 0.000355 ± 0.001440 = (-0.0011, 0.0018) The 95% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = 0.011248 2.571(0.007605) = 0.011248 ± 0.019552 = (-0.0083, 0.0308) The 95% confidence interval for population partial regression coefficient 3 is: b3 t s b3 = -0.02116 2.571(0.02471) = -0.02116 ± 0.06353 = (-0.0847, 0.0424) c. The coefficient of multiple determination is 0.465. This indicates that 46.5% of the variation in the warning distance is explained by the regression equation. However, none of the partial regression coefficients is significant at the 0.10 level. (The coefficient for price is significant at the 0.554 level; 551 for sensitivity, at the 0.199 level; and, for weight, at the 0.431 level.) The overall regression is only significant at the 0.334 level. The adjusted R-square is 0.144. Recall that this has been adjusted for the degrees of freedom. Thus, there are no significant relationships in this regression. Apparently the coefficient of multiple determination is as large as it is because of the limited size of the data set. d. The residual analyses follow. First the histogram of the residuals is examined to see if it is symmetric about zero. Next the normal probability plot is graphed to examine whether the residuals could have come from a normally distributed population. Finally, the residuals are plotted against each of the independent variables to check for cyclical patterns. In the following histogram of residuals, there seems to be a slight deviation from a symmetric distribution, but the number of data values is relatively small. Histogram of the Residuals (response is Distance) 4 3 Frequency 2 1 0 -0.075 -0.050 -0.025 0.000 0.025 0.050 Residual In this Minitab test for normality, the points in the normal probability plot appear to deviate excessively from a straight line and the approximate p-value is shown as 0.048. At the 0.05 level of significance, we would conclude that the residuals did not come from a normally distributed population. Probability Plot of RESI1 Normal 99 Mean 9.868649E-17 StDev 0.04085 95 N 9 KS 0.276 90 P-Value 0.048 80 70 Percent 60 50 40 30 20 10 5 1 -0.10 -0.05 0.00 0.05 0.10 RESI1 552 The plots of residuals versus the independent variables are shown below. Given the relatively small number of data points, none of the three plots shows any alarming patterns. In the third plot, most of the unusual pattern is due to the underlying data, with most of the weights clustered about the six ounce level, while one of the detectors weighs only 3.8 ounces. Residuals Versus Price (response is Distance) 0.050 0.025 0.000 Residual -0.025 -0.050 -0.075 200 220 240 260 280 300 Price Residuals Versus Sensitiv (response is Distance) 0.050 0.025 0.000 Residual -0.025 -0.050 -0.075 102 104 106 108 110 112 Sensitiv Residuals Versus Weight (response is Distance) 0.050 0.025 0.000 Residual -0.025 -0.050 -0.075 4.0 4.5 5.0 5.5 6.0 6.5 Weight 553 Overall, the residual analysis suggests that the residuals may not have come from a normally distributed population. If this is true, then one of the underlying assumptions has been violated and the multiple regression analysis may not be valid. The Excel multiple regression solution for the data in this exercise is shown below. Note that Excel already provides 95% confidence intervals for the population regression coefficients. A B C D E F G 11 Distance Price Sensitivity Weight 12 SUMMARY OUTPUT 0.675 289 108 3.8 13 Regression Statistics 0.660 295 110 6.1 14 Multiple R 0.6817 0.640 240 108 5.8 15 R Square 0.4647 0.560 249 103 6.6 16 Adjusted R Square 0.1436 0.540 260 107 6.0 17 Standard Error 0.0517 0.640 200 108 5.8 18 Observations 9 0.540 199 109 5.9 19 0.645 220 108 5.8 20 ANOVA 0.670 250 112 6.2 21 df SS MS F Significance F 22 Regression 3 0.0116 0.0039 1.4471 0.3342 23 Residual 5 0.0133 0.0027 24 Total 8 0.0249 25 26 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 27 Intercept -0.56174 0.8656 -0.6490 0.5450 -2.7868 1.6633 28 Price 0.00035 0.0006 0.6338 0.5541 -0.0011 0.0018 29 Sensitivity 0.01125 0.0076 1.4790 0.1992 -0.0083 0.0308 30 Weight -0.02116 0.0247 -0.8564 0.4309 -0.0847 0.0424 When generating the Excel printout, we can also specify a normal probability plot and plots of the residuals against the independent variables. Their appearance is essentially similar to those of Minitab. J K L M N O P Q R S 1 2 Normal Probability Plot Price Residual Plot 0.8 0.06 3 0.04 0.6 Distance 4 Residuals 0.02 5 0.4 0 6 0.2 -0.02100 150 200 250 300 350 7 -0.04 0 -0.06 8 0 20 40 60 80 100 -0.08 9 Sample Percentile Price 10 J K L M N O P Q R S 11 12 Sensitivity Residual Plot Weight Residual Plot 0.06 0.06 13 0.04 0.04 14 0.02 Residuals Residuals 0.02 15 0 0 16 -0.02102 104 106 108 110 112 114 -0.02 2.0 3.0 4.0 5.0 6.0 7.0 17 -0.04 -0.04 18 -0.06 -0.06 -0.08 -0.08 19 Sensitivity Weight 20 16.46 d/p/e A dummy variable is a variable that takes on a value of one or zero to indicate the presence or absence of an attribute. Dummy variables can help explain some of the variation in y due to the presence or absence of a characteristic. Three dummy variables that can be used to describe one town versus another are URBAN (1 if urban, 0 otherwise), MANUF (1 if durable goods manufacturing is the major 554 industry, 0 otherwise), and POPMIL (1 if the population is 1 million or more, 0 otherwise). Other dummy variables could include the presence of a major university, a major medical center, a major research institution, and many more. 16.47 p/p/e The partial regression coefficient for x1 implies that, holding the day of the week constant, a one degree Fahrenheit increase in the temperature will result in an increase of 8 in attendance. The partial regression coefficient for x2 implies that the attendance increases by 150 people on Saturdays and Sundays (assuming a constant temperature). 16.48 p/p/m The estimate of 100 persons swimming on a zero-degree weekday is made well beyond the limits of the underlying temperature data. It is always dangerous to extrapolate beyond the bounds of the data used to estimate an equation. 16.49 d/p/m Multicollinearity is a situation in which two or more of the independent variables in a multiple regression are highly correlated with each other. When this happens, the two correlated x variables are really not saying different things about y. The standard errors for the partial regression coefficients become very large and the coefficients are statistically unreliable and difficult to interpret. Multicollinearity is a problem when we are trying to interpret the partial regression coefficients. There are several clues to the presence of multicollinearity: (1) an independent variable known to be an important predictor ends up having a partial regression coefficient that is not significant; (2) a partial regression coefficient exhibits the wrong sign; and/or, (3) when an independent variable is added or deleted, the partial regression coefficients for the other variables change dramatically. A more practical way to identify multicollinearity is through the examination of a correlation matrix, which is a matrix that shows the correlation of each variable with each of the other variables. A high correlation between two independent variables is an indication of multicollinearity. 16.50 p/c/m The Minitab printout is shown below. Regression Analysis: Pounds versus Months, Session, Gender The regression equation is Pounds = 2.24 + 3.36 Months + 1.54 Session + 3.02 Gender Predictor Coef SE Coef T P Constant 2.243 8.876 0.25 0.807 Months 3.356 1.271 2.64 0.030 Session 1.538 6.791 0.23 0.826 Gender 3.018 6.671 0.45 0.663 S = 11.39 R-Sq = 48.5% R-Sq(adj) = 29.1% Analysis of Variance Source DF SS MS F P Regression 3 975.0 325.0 2.51 0.133 Residual Error 8 1037.3 129.7 Total 11 2012.2 The partial regression coefficient for Months implies that, holding session and gender constant, an additional month at the weight-loss clinic results in an additional weight loss of 3.356 pounds. The partial regression coefficient for Session implies that persons attending the day sessions, holding months and gender constant, lose 1.538 more pounds than those attending the night sessions. The partial regression coefficient for Gender implies that, holding months and session constant, men lose 3.018 more pounds than women. Of course, the partial regression coefficients for Session and Gender have p-values of 0.826 and 0.663, respectively. This indicates that the true coefficients are likely not different from zero. 555 Therefore, Months (p-value of 0.030) contributes the most to the explanatory power of this regression equation. The data and Excel multiple regression solution for this exercise are shown below. A B C D E F G 1 Pounds Lost Months Session Gender 2 31 5 1 1 3 49 8 1 1 4 12 3 1 0 5 SUMMARY OUTPUT 26 9 0 0 6 Regression Statistics 34 8 0 1 7 Multiple R 0.6961 11 2 0 0 8 R Square 0.4845 4 1 0 1 9 Adjusted R Square 0.2912 27 8 0 1 10 Standard Error 11.3867 12 6 1 1 11 Observations 12 28 9 1 0 12 41 6 0 0 13 ANOVA 16 6 0 0 14 df SS MS F Significance F 15 Regression 3 974.9973 324.9991 2.5066 0.1329 16 Residual 8 1037.2527 129.6566 17 Total 11 2012.2500 18 19 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 20 Intercept 2.2435 8.8760 0.2528 0.8068 -18.2246 22.7115 21 Months 3.3561 1.2714 2.6396 0.0297 0.4241 6.2880 22 Session 1.5383 6.7911 0.2265 0.8265 -14.1220 17.1987 23 Gender 3.0176 6.6710 0.4523 0.6630 -12.3658 18.4010 16.51 p/c/m The Minitab printout is shown below. Regression Analysis: Speed versus Occupnts, SeatBelt The regression equation is Speed = 67.6 - 3.21 Occupnts - 6.63 SeatBelt Predictor Coef SE Coef T P Constant 67.629 5.017 13.48 0.000 Occupnts -3.214 2.191 -1.47 0.170 SeatBelt -6.629 3.200 -2.07 0.063 S = 5.465 R-Sq = 31.5% R-Sq(adj) = 19.1% Analysis of Variance Source DF SS MS F P Regression 2 151.20 75.60 2.53 0.125 Residual Error 11 328.51 29.86 Total 13 479.71 The partial regression coefficient for Occupnts implies that, holding seat belt usage constant, the speed decreases by 3.214 miles per hour for each additional occupant in the car. The partial regression coefficient for SeatBelt implies that, for a given number of occupants, drivers who wear seat belts travel 6.629 miles per hour slower than those who do not. The p-value for Occupnts is 0.170; this implies that the partial regression coefficient for this variable is not significantly different from zero. The p-value for SeatBelt is 0.063; this implies that the partial regression coefficient for this variable is significantly different from zero at the 0.063 level. It appears that seat belt usage provides a much stronger explanation for the variation in speeds driven by various drivers than does the number of occupants in the car. 556 The Excel multiple regression solution for this exercise is shown below. D E F G H I J 2 3 Regression Statistics 4 Multiple R 0.5614 5 R Square 0.3152 6 Adjusted R Square 0.1907 7 Standard Error 5.4649 8 Observations 14 9 10 ANOVA 11 df SS MS F Significance F 12 Regression 2 151.2000 75.6000 2.5314 0.1246 13 Residual 11 328.5143 29.8649 14 Total 13 479.7143 15 16 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 17 Intercept 67.6286 5.0172 13.4795 0.0000 56.5859 78.6713 18 Occupnts -3.2143 2.1908 -1.4672 0.1703 -8.0363 1.6077 19 SeatBelt -6.6286 3.1999 -2.0715 0.0626 -13.6715 0.4144 CHAPTER EXERCISES 16.52 p/c/m The Minitab printout is shown below. Regression Analysis: Tip versus Check, Diners The regression equation is Tip = - 1.92 + 0.223 Check - 0.184 Diners Predictor Coef SE Coef T P Constant -1.915 1.598 -1.20 0.284 Check 0.22275 0.04608 4.83 0.005 Diners -0.1845 0.4133 -0.45 0.674 S = 1.524 R-Sq = 83.4% R-Sq(adj) = 76.8% Analysis of Variance Source DF SS MS F P Regression 2 58.389 29.194 12.57 0.011 Residual Error 5 11.611 2.322 Total 7 70.000 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 6.441 0.800 ( 4.385, 8.498) ( 2.017, 10.866) Values of Predictors for New Observations New Obs Check Diners 1 40.0 3.00 a. The regression equation is: Tip = -1.915 + 0.22275*Check - 0.1845*Diners The partial regression coefficient for the check indicates that, the number of diners held constant, a $1 increase in the check will result in a $0.22275 increase in the tip. The partial regression coefficient for 557 the number of diners indicates that, holding the size of the check constant, an additional diner will result in a tip that is $0.1845 smaller. b. The estimated tip amount for three diners who have a $40 check is $6.441. c. The 95% prediction interval for the tip left by a dining party like the one in part b is $2.017 to $10.866. d. The 95% confidence interval for the mean tip left by all dining parties like the one in part b is $4.385 to $8.498. e. The appropriate value for d.f. will be d.f. = 8 - 2 – 1 = 5. The t-value for a 95% confidence interval with 5 degrees of freedom is t = 2.571. The 95% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = 0.22275 2.571(0.04608) = 0.22275 ± 0.11847 = (0.1043, 0.3412) The 95% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = -0.1845 2.571(0.4133) = -0.1845 ± 1.0626 = (-1.2471, 0.8781) f. The significance tests for the partial regression coefficients show that the partial regression coefficient for the size of the check is significant at the 0.005 level, while the partial regression coefficient for the number of diners is significant at the 0.674 level. Thus, the size of the check is much more useful in predicting the size of the tip than the number of diners. The overall regression is significant at the 0.011 level. The coefficient of multiple determination indicates that 83.4% of the variation in the size of the tip is explained by the regression. g. The residual analyses follow. First the histogram of the residuals is examined to see if it is symmetric about zero. Next the normal probability plot is graphed to examine whether the residuals could have come from a normally distributed population. Finally, the residuals are plotted against each of the independent variables to check for cyclical patterns. In the following histogram of the residuals, there are a lot of values in the category with 1.0 as the midpoint. This is some cause for concern, even though there are relatively few observations in the data set. Histogram of the Residuals (response is Tip) 4 3 Frequency 2 1 0 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 Residual In this Minitab test for normality, the points in the normal probability plot seem to deviate excessively from a straight line and the approximate p-value is shown as 0.040. At the 0.05 level of significance, we would conclude that the residuals did not come from a normally distributed population. 558 Probability Plot of RESI1 Normal 99 Mean 2.498002E-16 StDev 1.288 95 N 8 KS 0.300 90 P-Value 0.040 80 70 Percent 60 50 40 30 20 10 5 1 -3 -2 -1 0 1 2 3 RESI1 The plots for residuals versus the independent variables are shown below. No alarming patterns seem to be present in the two charts that follow. However, overall, the residual analysis provides some evidence to suggest that the residuals may have come from a non-normally distributed population. Residuals versus size of check. Residuals Versus Check (response is Tip) 1.0 0.5 0.0 Residual -0.5 -1.0 -1.5 -2.0 -2.5 10 20 30 40 50 Check Residuals versus number of diners. Residuals Versus Diners (response is Tip) 1.0 0.5 0.0 Residual -0.5 -1.0 -1.5 -2.0 -2.5 1 2 3 4 5 Diners 559 The Excel multiple regression solution for the data in this exercise is shown below. Note that Excel already provides 95% confidence intervals for the population regression coefficients. A B C D E F G 12 SUMMARY OUTPUT Tip Check Diners 13 Regression Statistics 7.5 40 2 14 Multiple R 0.9133 0.5 15 1 15 R Square 0.8341 2.0 30 3 16 Adjusted R Square 0.7678 3.5 25 4 17 Standard Error 1.5239 9.5 50 4 18 Observations 8 2.5 20 5 19 3.5 35 5 20 ANOVA 1.0 10 2 21 df SS MS F Significance F 22 Regression 2 58.3886 29.1943 12.5714 0.0112 23 Residual 5 11.6114 2.3223 24 Total 7 70.0000 25 26 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 27 Intercept -1.9154 1.5980 -1.1986 0.2844 -6.0231 2.1923 28 Check 0.2228 0.0461 4.8337 0.0047 0.1043 0.3412 29 Diners -0.1845 0.4133 -0.4463 0.6741 -1.2469 0.8780 Excel has generated the optional normal probability plot and plots of the residuals against the independent variables. Their appearance is essentially similar to those of Minitab. J K L M N 1 2 Normal Probability Plot 10 3 8 4 6 Tip 5 4 6 2 7 0 8 0 20 40 60 80 100 9 Sample Percentile 10 560 J K L M N O P Q R S 11 12 Check Residual Plot Diners Residual Plot 13 2 2 14 1 1 Residuals Residuals 15 0 0 16 -1 0 10 20 30 40 50 60 -1 0 1 2 3 4 5 6 17 -2 -2 18 -3 -3 19 Check Diners 20 16.53 p/c/m The Minitab printout is shown below. Regression Analysis: AllFruit versus Apples, Grapes The regression equation is AllFruit = 99.9 + 1.24 Apples + 0.822 Grapes Predictor Coef SE Coef T P Constant 99.865 2.952 33.83 0.000 Apples 1.23640 0.09971 12.40 0.001 Grapes 0.8221 0.2307 3.56 0.038 S = 0.269451 R-Sq = 98.1% R-Sq(adj) = 96.9% Analysis of Variance Source DF SS MS F P Regression 2 11.3555 5.6778 78.20 0.003 Residual Error 3 0.2178 0.0726 Total 5 11.5733 Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 125.816 0.433 (124.439, 127.193) (124.194, 127.438)XX XX denotes a point that is an extreme outlier in the predictors. Values of Predictors for New Observations New Obs Apples Grapes 1 17.0 6.00 a. The regression equation is AllFruit = 99.865 + 1.2364*Apples + 0.8221*Grapes. The partial regression coefficient for apples implies that, holding the consumption of grapes constant, a one pound increase in the consumption of apples will result in a 1.2364 pound increase in the consumption of all fresh fruits. The partial regression coefficient for grapes implies that, holding apple consumption constant, a one pound increase in the consumption of grapes will result in a 0.8221 pound increase in the consumption of all fresh fruits. b. The estimated per capita consumption of all fresh fruits during a year when 17 pounds of apples and 6 pounds of grapes are consumed is 125.816 pounds. c. The 95% prediction interval for per capita consumption during a year like the one in part b is 561 124.194 to 127.438 pounds. d. The 95% confidence interval for mean per capita consumption during all years like the one in part b is 124.439 to 127.193 pounds. e. For this problem, the appropriate d.f. = 6 – 2 – 1 = 3. The t-value for a 95% confidence interval with 3 degrees of freedom is 3.182. The 95% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = 1.2364 3.182(0.09971) = 1.2364 ± 0.3173 = (0.92, 1.55) The 95% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = 0.8221 3.182(0.2307) = 0.8221 ± 0.7341 = (0.09, 1.56) f. Both of the partial regression coefficients are impressive (apples p-value, 0.001; grapes p-value, 0.038). Also, the overall regression is highly significant; p-value = 0.003.The coefficient of multiple determination is 0.981. This regression appears to do a very good job of explaining the variation in fresh fruit consumption. g. The residual analyses follow. First the histogram of the residuals is examined to see if it is symmetric about zero. Next the normal probability plot is graphed to examine whether the residuals could have come from a normally distributed population. Finally, the residuals are plotted against each of the independent variables to check for cyclical patterns. The histogram of the residuals is shown below. This histogram offers no reason to believe that the residuals may not have come from a normally distributed population. Histogram of the Residuals (response is AllFruit) 2.0 1.5 Frequency 1.0 0.5 0.0 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 Residual In this Minitab test for normality, the points in the normal probability plot do not deviate excessively from a straight line and the approximate p-value is shown as > 0.150. Our conclusion is that the residuals could have come from a normally distributed population. 562 Probability Plot of RESI1 Normal 99 Mean -2.13163E-14 StDev 0.2087 95 N 6 KS 0.125 90 P-Value >0.150 80 70 Percent 60 50 40 30 20 10 5 1 -0.50 -0.25 0.00 0.25 0.50 RESI1 The plots for residuals versus the independent variables are shown below. For this small data set, no alarming patterns seem to be present. Overall, the residual analysis provides no evidence to suggest that the assumptions for the multiple regression model have not been satisfied. Residuals versus per-capita apple consumption. Residuals Versus Apples (response is AllFruit) 0.3 0.2 0.1 Residual 0.0 -0.1 -0.2 -0.3 16.0 16.5 17.0 17.5 18.0 18.5 19.0 Apples Residuals versus per-capita grape consumption. 563 Residuals Versus Grapes (response is AllFruit) 0.3 0.2 0.1 Residual 0.0 -0.1 -0.2 -0.3 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4 8.6 8.8 Grapes The Excel multiple regression solution for the data in this exercise is shown below. Note that Excel already provides 95% confidence intervals for the population regression coefficients. E F G H I J K 1 SUMMARY OUTPUT 2 3 Regression Statistics 4 Multiple R 0.9905 5 R Square 0.9812 6 Adjusted R Square 0.9686 7 Standard Error 0.2695 8 Observations 6 9 10 ANOVA 11 df SS MS F Significance F 12 Regression 2 11.3555 5.6778 78.2017 0.0026 13 Residual 3 0.2178 0.0726 14 Total 5 11.5733 15 16 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 17 Intercept 99.8648 2.9516 33.8343 0.0001 90.4715 109.2581 18 Apples 1.2364 0.0997 12.4001 0.0011 0.9191 1.5537 19 Grapes 0.8221 0.2307 3.5632 0.0377 0.0878 1.5563 Excel has generated the optional normal probability plot and plots of the residuals against the independent variables. Their appearance is essentially similar to those of Minitab. 564 O P Q R S 15 16 Normal Probability Plot 17 131 130 18 AllFruit 129 19 128 20 127 126 21 125 22 0 20 40 60 80 100 23 24 Sample Percentile 25 O P Q R S O P Q R S 26 37 27 Apples Residual Plot 38 Grapes Residual Plot 28 0.4 39 0.4 29 40 Residuals Residuals 0.2 0.2 30 41 0 0 31 42 15 16 17 18 19 20 7.0 8.0 9.0 32 -0.2 43 -0.2 33 -0.4 44 -0.4 34 45 35 Apples 46 Grapes 36 47 16.54 p/c/m The Minitab printout is shown below. Regression Analysis: Salary versus GPA, Activities The regression equation is Salary = 24.3 + 3.84 GPA + 1.68 Activities Predictor Coef SE Coef T P Constant 24.309 3.192 7.62 0.000 GPA 3.842 1.234 3.11 0.017 Activiti 1.6810 0.5291 3.18 0.016 S = 1.448 R-Sq = 82.4% R-Sq(adj) = 77.4% Analysis of Variance Source DF SS MS F P Regression 2 68.924 34.462 16.44 0.002 Residual Error 7 14.676 2.097 Total 9 83.600 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 43.182 1.131 ( 40.506, 45.858) ( 38.834, 47.530) Values of Predictors for New Observations New Obs GPA Activiti 1 3.60 3.00 a. The regression equation is: Salary = 24.309 + 3.842*GPA + 1.6810*Activities. 565 The partial regression coefficient for the GPA indicates that, holding the number of activities constant, a one point increase in GPA will result in a starting salary that is $3842 higher. The partial regression coefficient for the number of activities indicates that, holding the GPA constant, an additional activity will result in a starting salary that is $1681 higher. b. The estimated starting salary for Dave (3.6 grade point average and 3 activities) is $43,182. c. The 95% prediction interval for the starting salary for Dave is between $38,834 and $47,530. d. The 95% confidence interval for the mean starting salary for all persons like Dave (i.e., 3.6 GPA and 3 activities) is between $40,506 and $45,858. e. For this problem, the appropriate d.f. = 10 - 2 - 1 = 7. The t-value for a 95% confidence interval with 7 degrees of freedom is 2.365. The 95% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = 3.842 2.365(1.234) = 3.842 ± 2.918 = (0.924, 6.760) The 95% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = 1.6810 2.365(0.5291) = 1.6810 ± 1.2513 = (0.4297, 2.9323) f. The partial regression coefficients for grade point average and activities are both significant at the 0.05 level. (GPA p-value, 0.017; Activities p-value, 0.016) The overall regression is significant at the 0.002 level. The coefficient of multiple determination indicates that 82.4% of the variation in starting salaries is explained by the regression. g. The residual analyses follow. First the histogram of the residuals is examined to see if it is symmetric about zero. Next the normal probability plot is graphed to examine whether the residuals could have come from a normally distributed population. Finally, the residuals are plotted against each of the independent variables to check for cyclical patterns. Shown below, the histogram is fairly symmetric and there is no evidence to suggest that the residuals may not have come from a normal distribution. Histogram of the Residuals (response is Salary) 2.0 1.5 Frequency 1.0 0.5 0.0 -2.4 -1.2 0.0 1.2 2.4 Residual In this Minitab test for normality, the points in the normal probability plot do not deviate excessively from a straight line and the approximate p-value is shown as > 0.150. Our conclusion is that the residuals could have come from a normally distributed population. 566 Probability Plot of RESI1 Normal 99 Mean -1.42109E-15 StDev 1.277 95 N 10 KS 0.125 90 P-Value >0.150 80 70 Percent 60 50 40 30 20 10 5 1 -3 -2 -1 0 1 2 3 RESI1 The plots for residuals versus the independent variables are shown below. For this small data set, no alarming patterns seem to be present. Overall, the residual analysis provides no evidence to suggest that the assumptions for the multiple regression model have not been satisfied. Residuals versus grade point average. Residuals Versus GPA (response is Salary) 2 1 Residual 0 -1 -2 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 GPA Residuals versus number of activities. 567 Residuals Versus Activities (response is Salary) 2 1 Residual 0 -1 -2 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Activities The Excel multiple regression solution for the data in this exercise is shown below. Note that Excel already provides 95% confidence intervals for the population regression coefficients. E F G H I J K 1 Salary GPA Activities 2 SUMMARY OUTPUT 40 3.2 2 3 46 3.6 5 4 Regression Statistics 38 2.8 3 5 Multiple R 0.9080 39 2.4 4 6 R Square 0.8245 37 2.5 2 7 Adjusted R Square 0.7743 38 2.1 3 8 Standard Error 1.4479 42 2.7 3 9 Observations 10 37 2.6 2 10 44 3.0 4 11 ANOVA 41 2.9 3 12 df SS MS F Significance F 13 Regression 2 68.9244 34.4622 16.4379 0.0023 14 Residual 7 14.6756 2.0965 15 Total 9 83.6000 16 Lower 95.0% 17 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 18 Intercept 24.3092 3.1919 7.6159 0.0001 16.7615 31.8569 19 GPA 3.8416 1.2342 3.1127 0.0170 0.9233 6.7600 20 Activities 1.6810 0.5291 3.1768 0.0156 0.4298 2.9322 568 Excel has generated the optional normal probability plot and plots of the residuals against the independent variables. Their appearance is essentially similar to those of Minitab. N O P Q R 1 2 Normal Probability Plot 3 60 4 Salary 40 5 20 6 7 0 8 0 50 100 9 Sample Percentile 10 N O P Q R S T U V W 11 12 GPA Residual Plot Activities Residual Plot 13 4 4 Residuals Residuals 14 2 2 15 0 0 16 -2 -2 17 -4 -4 18 0.0 1.0 2.0 3.0 4.0 0 2 4 6 19 GPA Activities 20 16.55 p/c/m The Minitab printout is shown below. Regression Analysis: FrGPA versus SAT, HSRank The regression equation is FrGPA = - 1.98 + 0.00372 SAT + 0.00658 HSRank Predictor Coef SE Coef T P Constant -1.984 1.532 -1.30 0.218 SAT 0.003719 0.001562 2.38 0.033 HSRank 0.006585 0.008023 0.82 0.427 S = 0.4651 R-Sq = 45.2% R-Sq(adj) = 36.8% Analysis of Variance Source DF SS MS F P Regression 2 2.3244 1.1622 5.37 0.020 Residual Error 13 2.8125 0.2163 Total 15 5.1370 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 2.634 0.125 ( 2.365, 2.904) ( 1.594, 3.674) Values of Predictors for New Observations New Obs SAT HSRank 1 1100 80.0 a. The regression equation is: FrGPA = -1.984 + 0.003719*SAT + 0.006585*HSRank. 569 The partial regression coefficient for the SAT score indicates that, holding the rank constant, a 1 point increase in the SAT score will result in a 0.003719 point increase in the freshman GPA. The coefficient for the high school rank indicates that, holding the SAT score constant, a 1 point increase in the high school rank will result in a 0.006585 point increase in freshman GPA. b. The estimated freshman GPA for a student who scored 1100 on the SAT and had a class rank of 80% is 2.634. c. The 95% prediction interval for the GPA for a student like the one in part b is between 1.594 and 3.674. d. The 95% confidence interval for the mean GPA for all students like the one in part b is 2.365 to 2.904. e. For this problem, the appropriate d.f. = 16 - 2 - 1 = 13. The t-value for a 95% interval with 13 degrees of freedom is 2.160. The 95% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = 0.003719 2.160(0.001562) = 0.003719 ± 0.003374 = (0.000345, 0.007093) The 95% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = 0.006585 2.160(0.008023) = 0.006585 ± 0.017330 = (-0.010745, 0.023915) f. The partial regression coefficient for the SAT score is significantly different from zero at the 0.033 level. The partial regression coefficient for the high school rank is not significantly different from zero (p-value = 0.427). The overall regression is significant at the 0.020 level. g. The residual analyses follow. First the histogram of the residuals is examined to see if it is symmetric about zero. Next the normal probability plot is graphed to examine whether the residuals could have come from a normally distributed population. Finally, the residuals are plotted against each of the independent variables to check for cyclical patterns. The histogram below seems to be fairly symmetric about zero. Histogram of the Residuals (response is FrGPA) 7 6 5 Frequency 4 3 2 1 0 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 Residual In this Minitab test for normality, the points in the normal probability plot do not deviate excessively from a straight line and the approximate p-value is shown as > 0.150. Our conclusion is that the residuals could have come from a normally distributed population. 570 Probability Plot of RESI1 Normal 99 Mean -9.99201E-16 StDev 0.4330 95 N 16 KS 0.167 90 P-Value >0.150 80 70 Percent 60 50 40 30 20 10 5 1 -1.0 -0.5 0.0 0.5 1.0 RESI1 The plots for residuals versus the independent variables are shown below. Neither of the plots reveals any alarming patterns that suggest the underlying assumptions of the multiple regression analysis may have been violated. Overall, the residual analysis does not reveal anything to suggest that the assumptions underlying the multiple regression analysis have been violated. Residuals versus SAT score. Residuals Versus SAT (response is FrGPA) 0.50 0.25 Residual 0.00 -0.25 -0.50 -0.75 950 1000 1050 1100 1150 1200 1250 SAT 571 Residuals versus high school rank. Residuals Versus HSRank (response is FrGPA) 0.50 0.25 Residual 0.00 -0.25 -0.50 -0.75 40 50 60 70 80 90 100 HSRank The Excel multiple regression solution for the data in this exercise is shown below. Note that Excel already provides 95% confidence intervals for the population regression coefficients. A B C D E F G 20 SUMMARY OUTPUT 21 Regression Statistics 22 Multiple R 0.6727 23 R Square 0.4525 24 Adjusted R Square 0.3683 25 Standard Error 0.4651 26 Observations 16 27 28 ANOVA 29 df SS MS F Significance F 30 Regression 2 2.3244 1.1622 5.3719 0.0199 31 Residual 13 2.8125 0.2163 32 Total 15 5.1370 33 34 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 35 Intercept -1.983878 1.53190 -1.29505 0.21783 -5.29334 1.32558 36 SAT 0.003719 0.00156 2.38050 0.03328 0.00034 0.00709 37 HS Rank 0.006585 0.00802 0.82074 0.42659 -0.01075 0.02392 Excel has generated the optional normal probability plot and plots of the residuals against the independent variables. Their appearance is essentially similar to those of Minitab. 572 J K L M N 1 2 Normal Probability Plot 3 5 4 Fr. GPA 4 5 3 2 6 1 7 0 8 0 50 100 9 Sample Percentile 10 J K L M N O P Q R S 11 12 SAT Residual Plot HS Rank Residual Plot 13 1.0 1.0 Residuals Residuals 14 0.5 0.5 15 0.0 0.0 16 -0.5 -0.5 17 -1.0 -1.0 18 900 1000 1100 1200 1300 40.0 60.0 80.0 100.0 120.0 19 SAT HS Rank 20 16.56 p/c/m The Minitab printout is shown below. Regression Analysis: Price versus Acres, SqFeet, CentralAir The regression equation is Price = 36045 + 15663 Acres + 10.9 SqFeet + 4181 CentralAir Predictor Coef SE Coef T P Constant 36045 14539 2.48 0.025 Acres 15663 5716 2.74 0.015 SqFeet 10.875 4.959 2.19 0.043 CentralA 4181 5652 0.74 0.470 S = 12321 R-Sq = 41.8% R-Sq(adj) = 30.9% Analysis of Variance Source DF SS MS F P Regression 3 1745571591 581857197 3.83 0.030 Residual Error 16 2428953909 151809619 Total 19 4174525500 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 73898 5236 ( 62799, 84998) ( 45518, 102278) Values of Predictors for New Observations New Obs Acres SqFeet CentralA 1 0.900 1800 1.00 573 a. The regression equation is: Price = 36045 + 15663*Acres + 10.875*SqFeet + 4181*CentralAir. The partial regression coefficient for the lot size indicates that, all other variables held constant, an additional acre of land will add $15,663 to the selling price. The partial regression coefficient for the size of the living area indicates that, all other variables held constant, an additional square foot of living area will add $10.875 to the selling price. Finally, the partial regression coefficient for the presence of central air conditioning indicates that, all other variables held constant, the presence of central air will increase the selling price by $4181. b. The estimated selling price for a house sitting on a 0.9 acre lot with 1800 square feet of living area with central air conditioning is $73,898. c. The 95% prediction interval for the selling price of the house described in part b is between $45,518 and $102,278. d. The 95% confidence interval for the mean selling price of all houses like the one in part b is between $62,799 and $84,998. e. For this problem, the appropriate d.f. = 20 - 3 - 1 = 16. The t-value for a 95% interval with 16 degrees of freedom is 2.120. The 95% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = 15,663 2.120(5716) = 15,663 ± 12,117.92 = (3545.08, 27,780.92) The 95% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = 10.875 2.120(4.959) = 10.875 ± 10.513 = (0.362, 21.388) The 95% confidence interval for population partial regression coefficient 3 is: b3 t s b3 = 4181 2.120(5652) = 4181 ± 11,982.24 = (-7801.24, 16,163.24) f. The partial regression coefficient for Acres is significantly different from zero at the 0.015 level, and the coefficient for SqFeet significantly differs from zero at the 0.043 level. However, the coefficient for CentralAir does not differ from zero significantly (p-value = 0.470). The overall regression is significant at the 0.030 level. g. The residual analyses follow. First the histogram of the residuals is examined to see if it is symmetric about zero. Next the normal probability plot is graphed to examine whether the residuals could have come from a normally distributed population. Finally, the residuals are plotted against each of the independent variables to check for cyclical patterns. The histogram below appears to be relatively symmetric. Histogram of the Residuals (response is Price) 6 5 4 Frequency 3 2 1 0 -20000 -10000 0 10000 20000 Residual 574 In this Minitab test for normality, the points in the normal probability plot do not deviate excessively from a straight line and the approximate p-value is shown as > 0.150. Our conclusion is that the residuals could have come from a normally distributed population. Probability Plot of RESI1 Normal 99 Mean -2.03727E-11 StDev 11307 95 N 20 KS 0.124 90 P-Value >0.150 80 70 Percent 60 50 40 30 20 10 5 1 -30000 -20000 -10000 0 10000 20000 30000 RESI1 The plots for residuals versus the independent variables are shown below. None of them reveals any alarming patterns that suggest the underlying assumptions of the multiple regression analysis may have been violated. Overall, the residual analysis does not reveal anything to suggest that the assumptions underlying the multiple regression analysis have been violated. Residuals versus lot size. Residuals Versus Acres (response is Price) 30000 20000 10000 Residual 0 -10000 -20000 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Acres Residuals versus living area. 575 Residuals Versus SqFeet (response is Price) 30000 20000 10000 Residual 0 -10000 -20000 1000 1500 2000 2500 3000 SqFeet Residuals versus central air conditioning. Residuals Versus CentralAir (response is Price) 30000 20000 10000 Residual 0 -10000 -20000 0.0 0.2 0.4 0.6 0.8 1.0 CentralAir The Excel multiple regression solution for the data in this exercise is shown below. Note that Excel already provides 95% confidence intervals for the population regression coefficients. 576 E F G H I J K 1 SUMMARY OUTPUT 2 3 Regression Statistics 4 Multiple R 0.6466 5 R Square 0.4181 6 Adjusted R Square 0.3091 7 Standard Error 12321.1 8 Observations 20 9 10 ANOVA 11 df SS MS F Significance F 12 Regression 3 1745571591 5.82E+08 3.8328 0.0304 13 Residual 16 2428953909 1.52E+08 14 Total 19 4174525500 15 16 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 17 Intercept 36045.0 14539.258 2.479 0.025 5223.179 66866.864 18 Acres 15662.9 5715.898 2.740 0.015 3545.743 27780.064 19 SqFeet 10.875 4.959 2.193 0.043 0.362 21.388 20 CentralAir 4181.1 5652.117 0.740 0.470 -7800.861 16163.040 Excel has generated the optional normal probability plot and plots of the residuals against the independent variables. Their appearance is essentially similar to those of Minitab. N O P Q R S T U V W 1 2 Normal Probability Plot Acres Residual Plot 3 150000 40000 Residuals 4 100000 20000 Price 5 0 6 50000 -20000 7 0 -40000 8 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 9 Sample Percentile Acres 10 N O P Q R S T U V W 11 12 SqFeet Residual Plot CentralAir Residual Plot 13 40000 40000 Residuals Residuals 14 20000 20000 15 0 0 16 -20000 -20000 17 -40000 -40000 18 0 1000 2000 3000 4000 0 0.5 1 1.5 19 SqFeet CentralAir 20 16.57 p/c/m The estimated selling price of a house occupying a 0.1 acre lot with 100 square feet of living area and no central air conditioning is $38,699. This selling price does not seem reasonable. The problem 577 with this estimate arises because the regression equation has been extrapolated far beyond the limits of the underlying data used to estimate it. 16.58 p/c/m The Minitab printout is shown below. Regression Analysis: Time versus Age, Gender The regression equation is Time = 69.5 + 0.110 Age - 12.2 Gender Predictor Coef SE Coef T P Constant 69.49 11.48 6.06 0.000 Age 0.1101 0.2257 0.49 0.635 Gender -12.186 5.312 -2.29 0.042 S = 8.720 R-Sq = 43.7% R-Sq(adj) = 33.5% Analysis of Variance Source DF SS MS F P Regression 2 649.25 324.63 4.27 0.042 Residual Error 11 836.46 76.04 Total 13 1485.71 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 74.45 3.40 ( 66.96, 81.93) ( 53.85, 95.05) Values of Predictors for New Observations New Obs Age Gender 1 45.0 0.000000 a. The regression equation is: Time = 69.49 + 0.1101*Age - 12.186*Gender. The partial regression coefficient for Age indicates that, holding the gender constant, an increase of one year in age will result in an increase of 0.1101 seconds to complete the transaction. The partial regression coefficient for Gender indicates that, holding the age constant, a male takes 12.186 seconds less to complete his transaction than a female. b. The estimated time required to complete a transaction by a female customer who is 45 years of age is 74.45 seconds. c. The 95% prediction interval for the time required by the customer described in part b is 53.85 to 95.05 seconds. d. The 95% confidence interval for the mean time required by all customers like the one in part b is 66.96 to 81.93 seconds. e. For this problem, the appropriate d.f. = 14 - 2 - 1 = 11. The t-value for a 95% interval with 11 degrees of freedom is 2.201. The 95% confidence interval for population partial regression coefficient 1 is: b1 t s b1 = 0.1101 2.201(0.2257) = 0.1101 ± 0.4968 = (-0.3867, 0.6069) The 95% confidence interval for population partial regression coefficient 2 is: b2 t s b 2 = -12.186 2.201(5.312) = -12.186 ± 11.692 = (-23.878, -0.494) f. The partial regression coefficient for age does not differ from zero significantly (p-value = 0.635). However, the coefficient for gender differs significantly from zero at the 0.042 level. The overall regression is significant at the 0.042 level. g. The residual analyses follow. First the histogram of the residuals is examined to see if it is symmetric about zero. Next the normal probability plot is graphed to examine whether the residuals could have come from a normally distributed population. Finally, the residuals are plotted against each of the independent variables to check for cyclical patterns. The histogram shown below does not seem very symmetrical, but the small number of observations could lead to erroneous conclusions. 578 Histogram of the Residuals (response is Time) 4 3 Frequency 2 1 0 -20 -15 -10 -5 0 5 10 Residual In this Minitab test for normality, the points in the normal probability plot do not deviate excessively from a straight line and the approximate p-value is shown as > 0.150. Our conclusion is that the residuals could have come from a normally distributed population Probability Plot of RESI1 Normal 99 Mean -3.55271E-14 StDev 8.021 95 N 14 KS 0.137 90 P-Value >0.150 80 70 Percent 60 50 40 30 20 10 5 1 -20 -10 0 10 20 RESI1 The plots for residuals versus the independent variables are shown below. Although the first plot seems to show more positive residuals for persons in the 40-50 age range, neither of the plots reveals any alarming patterns that suggest the underlying assumptions of the multiple regression analysis may have been 579 violated. Overall, the residual analysis does not reveal anything to suggest that the assumptions underlying the multiple regression analysis have been violated. Residuals versus age of customer. Residuals Versus Age (response is Time) 10 5 0 Residual -5 -10 -15 -20 20 30 40 50 60 70 Age Residuals versus gender of customer. Residuals Versus Gender (response is Time) 10 5 0 Residual -5 -10 -15 -20 0.0 0.2 0.4 0.6 0.8 1.0 Gender The Excel multiple regression solution for the data in this exercise is shown below. 580 A B C D E F G 18 SUMMARY OUTPUT 19 20 Regression Statistics 21 Multiple R 0.6611 22 R Square 0.4370 23 Adjusted R Square 0.3346 24 Standard Error 8.7202 25 Observations 14 26 27 ANOVA 28 df SS MS F Significance F 29 Regression 2 649.2501 324.6251 4.2690 0.0424 30 Residual 11 836.4641 76.0422 31 Total 13 1485.7143 32 33 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 34 Intercept 69.4926 11.4769 6.0550 0.0001 44.2322 94.7530 35 Age 0.1101 0.2257 0.4880 0.6351 -0.3866 0.6068 36 Gender -12.1858 5.3116 -2.2942 0.0425 -23.8765 -0.4951 Excel has generated the optional normal probability plot and plots of the residuals against the independent variables. Their appearance is essentially similar to those of Minitab. J K L M N 1 2 Normal Probability Plot 3 100 4 80 Time 60 5 40 6 20 7 0 8 0 20 40 60 80 100 9 Sample Percentile 10 J K L M N O P Q R S 12 13 Age Residual Plot Gender Residual Plot 14 20 20 Residuals Residuals 15 10 10 0 0 16 -10 -10 17 -20 -20 18 -30 -30 19 20 30 40 50 60 70 0 1 20 Age Gender 21 INTEGRATED CASES 581 THORNDIKE SPORTS EQUIPMENT Ted uses Minitab to generate the printout shown below. Regression Analysis: Skiers versus Weekend, SnowInch, Temperat The regression equation is Skiers = 560 + 147 Weekend + 1.42 SnowInch - 1.60 Temperat Predictor Coef SE Coef T P Constant 559.87 76.78 7.29 0.000 Weekend 147.35 51.86 2.84 0.009 SnowInch 1.424 2.696 0.53 0.602 Temperat -1.604 2.771 -0.58 0.568 S = 125.061 R-Sq = 25.4% R-Sq(adj) = 16.8% Analysis of Variance Source DF SS MS F P Regression 3 138705 46235 2.96 0.051 Residual Error 26 406650 15640 Total 29 545354 Examining the printout, Ted sees that the coefficient for Weekend is significantly different from zero at the 0.009 level, and the overall regression is significant at the 0.051 level. Overall, only 25.4% of the variation in daily ski patronage is explained by these independent variables. Perhaps some of the remaining variation could be at least partially explained by some other variables (e.g., live music or entertainment, conference attendance, or type of group staging a conference) that Ted has not included in his analysis. SPRINGDALE SHOPPING SURVEY Caution: These exercises include the recoding of two of the variables. If you save the revised data file, do so using a different filename. If you are using Minitab, recode as follows: 1. Click Data. Select Code. Click Numeric to Numeric. 2. Enter C26 C28 into the Code data from columns box. Enter C26 C28 into the Into columns box. Enter 2 into the Original values box. Enter 0 into the New box. Click OK. If you are using Excel, recode as follows: 1. Click and drag to select cells Z1:Z151. (This highlights the variable name, RESPGEND, and the 150 data values below.) Click Edit. Click Replace. 2. Enter 2 into the Find what box. Enter 0 into the Replace with box. Click Replace All. 3. Repeat steps 1 and 2 for cells AB1:AB151, which contain the variable name, RESPMARI, and the 150 data values below. 1a through 1e, with dependent variable 7, Attitude toward Springdale Mall. 582 Regression Analysis: SPRILIKE versus IMPVARIE, IMPHELP, ... The regression equation is SPRILIKE = 2.90 + 0.188 IMPVARIE + 0.0043 IMPHELP + 0.034 RESPGEND + 0.191 RESPMARI Predictor Coef SE Coef T P Constant 2.9009 0.2963 9.79 0.000 IMPVARIE 0.18839 0.05383 3.50 0.001 IMPHELP 0.00432 0.04203 0.10 0.918 RESPGEND 0.0341 0.1306 0.26 0.794 RESPMARI 0.1909 0.1232 1.55 0.123 S = 0.738875 R-Sq = 11.9% R-Sq(adj) = 9.5% Analysis of Variance Source DF SS MS F P Regression 4 10.7125 2.6781 4.91 0.001 Residual Error 145 79.1608 0.5459 Total 149 89.8733 1a. The partial regression coefficient for RESPGEND is 0.034. With the variables coded so that 1 = male and 0 = female, this implies that males tend to have an attitude toward Springdale Mall that is 0.034 points higher than the attitude displayed by females toward this shopping area. However, the p-value for the test of this partial regression coefficient is 0.794, which is not less than = 0.05, and the coefficient does not differ significantly from zero at the 0.05 level of significance. The partial regression coefficient for IMPVARIE (test p-value = 0.001) is the only one that differs significantly at the 0.05 level of significance. 1b. The p-value for the strength of the overall relationship is 0.001. This is less than the 0.05 level specified, so the overall regression equation is significant at the 0.05 level. 1c. The percentage of the variation in y that is explained by the regression equation is 11.9% (unadjusted). In the ANOVA portion of the printout, this is the Regression sum of squares (10.7125) divided by the Total sum of squares (89.8733). 1d. Plotting the residuals versus each of the independent variables. In each plot, the residuals seem to be unrelated to the independent variable, thus supporting the validity of the model. Residuals Versus IMPVARIE (response is SPRILIKE) 2 1 Residual 0 -1 -2 1 2 3 4 5 6 7 IMPVARIE 583 Residuals Versus IMPHELP (response is SPRILIKE) 2 1 Residual 0 -1 -2 1 2 3 4 5 6 7 IMPHELP Residuals Versus RESPGEND (response is SPRILIKE) 2 1 Residual 0 -1 -2 0.0 0.2 0.4 0.6 0.8 1.0 RESPGEND Residuals Versus RESPMARI (response is SPRILIKE) 2 1 Residual 0 -1 -2 0.0 0.2 0.4 0.6 0.8 1.0 RESPMARI 584 1e. In this Minitab test for normality, the points in the normal probability plot appear to deviate excessively from a straight line and the approximate p-value is shown as < 0.01. At the 0.05 level of significance, we would conclude that the residuals could not have come from a normally distributed population. For this regression analysis, it appears that the assumption of normality of residuals may have been violated. Probability Plot of RESI1 Normal 99.9 Mean -3.12639E-15 StDev 0.7289 99 N 150 KS 0.129 95 P-Value <0.010 90 80 70 Percent 60 50 40 30 20 10 5 1 0.1 -2 -1 0 1 2 RESI1 2. Repeating 1a through 1e, with dependent variable 8, Attitude toward Downtown. Regression Analysis: DOWNLIKE versus IMPVARIE, IMPHELP, ... The regression equation is DOWNLIKE = 3.72 + 0.0251 IMPVARIE - 0.0671 IMPHELP + 0.015 RESPGEND - 0.006 RESPMARI Predictor Coef SE Coef T P Constant 3.7211 0.3796 9.80 0.000 IMPVARIE 0.02512 0.06896 0.36 0.716 IMPHELP -0.06710 0.05384 -1.25 0.215 RESPGEND 0.0148 0.1673 0.09 0.929 RESPMARI -0.0057 0.1578 -0.04 0.971 S = 0.946571 R-Sq = 1.2% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 4 1.5205 0.3801 0.42 0.791 Residual Error 145 129.9195 0.8960 Total 149 131.4400 2a. The partial regression coefficient for RESPGEND is 0.015. With the variables coded so that 1 = male and 0 = female, this implies that males tend to have an attitude toward Downtown that is 0.015 points higher than the attitude displayed by females toward this shopping area. However, the p-value for the test of this partial regression coefficient is 0.929, which is not less than = 0.05, and the coefficient does not differ significantly from zero at the 0.05 level of significance. In this regression, none of the partial regression coefficients is significantly different from zero at the 0.05 level of significance. 2b. The p-value for the strength of the overall relationship is 0.791. This is not less than the 0.05 level specified, so the overall regression equation is not significant at the 0.05 level. 2c. The percentage of the variation in y that is explained by the regression equation is only 1.2% (unadjusted). In the ANOVA portion of the printout, this is the Regression sum of squares (1.5205) divided by the Total sum of squares (131.4400). 585 2d. Plotting the residuals versus each of the independent variables. In each plot, the residuals seem to be unrelated to the independent variable, thus supporting the validity of the model. Residuals Versus IMPVARIE (response is DOWNLIKE) 2 1 0 Residual -1 -2 -3 1 2 3 4 5 6 7 IMPVARIE Residuals Versus IMPHELP (response is DOWNLIKE) 2 1 0 Residual -1 -2 -3 1 2 3 4 5 6 7 IMPHELP Residuals Versus RESPGEND (response is DOWNLIKE) 2 1 0 Residual -1 -2 -3 0.0 0.2 0.4 0.6 0.8 1.0 RESPGEND 586 Residuals Versus RESPMARI (response is DOWNLIKE) 2 1 0 Residual -1 -2 -3 0.0 0.2 0.4 0.6 0.8 1.0 RESPMARI 2e. In this Minitab test for normality, the points in the normal probability plot appear to deviate excessively from a straight line and the approximate p-value is shown as < 0.01. At the 0.05 level of significance, we would conclude that the residuals could not have come from a normally distributed population. For this regression analysis, it appears that the assumption of normality of residuals may have been violated. Probability Plot of RESI1 Normal 99.9 Mean -5.62513E-17 StDev 0.9338 99 N 150 KS 0.148 95 P-Value <0.010 90 80 70 Percent 60 50 40 30 20 10 5 1 0.1 -3 -2 -1 0 1 2 3 RESI1 3. Repeating 1a through 1e, with dependent variable 9, Attitude toward West Mall. Regression Analysis: WESTLIKE versus IMPVARIE, IMPHELP, ... The regression equation is WESTLIKE = 3.54 - 0.0906 IMPVARIE + 0.0341 IMPHELP - 0.201 RESPGEND + 0.270 RESPMARI Predictor Coef SE Coef T P Constant 3.5398 0.4162 8.51 0.000 IMPVARIE -0.09060 0.07560 -1.20 0.233 IMPHELP 0.03413 0.05903 0.58 0.564 RESPGEND -0.2013 0.1834 -1.10 0.274 RESPMARI 0.2704 0.1730 1.56 0.120 S = 1.03772 R-Sq = 3.5% R-Sq(adj) = 0.9% Analysis of Variance Source DF SS MS F P Regression 4 5.729 1.432 1.33 0.262 Residual Error 145 156.144 1.077 Total 149 161.873 587 3a. The partial regression coefficient for RESPGEND is -0.2013. With the variables coded as 1 = male and 0 = female, males tend to have an attitude toward West Mall that is 0.2013 points lower than that displayed by females. However, p-value = 0.274 is not less than = 0.05, and the coefficient does not differ significantly from zero at the 0.05 level of significance. In this regression, none of the partial regression coefficients differs different from zero at the 0.05 level. 3b. The p-value for the strength of the overall relationship is 0.262. This is not less than the 0.05 level specified, so the overall regression equation is not significant at the 0.05 level. 3c. The percentage of the variation in y that is explained by the regression equation is only 3.5% (unadjusted). In the ANOVA portion of the printout, this is the Regression sum of squares (5.729) divided by the Total sum of squares (161.873). 3d. Plotting the residuals versus each of the independent variables. In each plot, the residuals seem to be unrelated to the independent variable, thus supporting the validity of the model. Residuals Versus IMPVARIE (response is WESTLIKE) 2 1 0 Residual -1 -2 -3 1 2 3 4 5 6 7 IMPVARIE Residuals Versus IMPHELP (response is WESTLIKE) 2 1 0 Residual -1 -2 -3 1 2 3 4 5 6 7 IMPHELP 588 Residuals Versus RESPGEND (response is WESTLIKE) 2 1 0 Residual -1 -2 -3 0.0 0.2 0.4 0.6 0.8 1.0 RESPGEND Residuals Versus RESPMARI (response is WESTLIKE) 2 1 0 Residual -1 -2 -3 0.0 0.2 0.4 0.6 0.8 1.0 RESPMARI 3e. In this Minitab test for normality, the points in the normal probability plot appear to deviate excessively from a straight line and the approximate p-value is shown as < 0.01. At the 0.05 level of significance, we would conclude that the residuals could not have come from a normally distributed population. For this regression analysis, it appears that the assumption of normality of residuals may have been violated Probability Plot of RESI1 Normal 99.9 Mean -1.89478E-16 StDev 1.024 99 N 150 KS 0.097 95 P-Value <0.010 90 80 70 Percent 60 50 40 30 20 10 5 1 0.1 -4 -3 -2 -1 0 1 2 3 4 RESI1 589 4. The four independent variables -- IMPVARIE, IMPHELP, RESPGEND, and RESPMARI -- do a better job of predicting attitude toward Springdale Mall (R-sq = 11.9%, overall p-value = 0.001) than for either Downtown (R-sq = 1.2%, p-value = 0.791) or West Mall (R-sq = 3.5%, p-value = 0.262). BUSINESS CASES EASTON REALTY COMPANY (A) 1. With regard to the two parties claiming their homes were not sold for fair market price by Easton: a. The selling price of the first home, not located in the Dallas portion of the metroplex, four years old, and with 2190 square feet, was $88,500. The selling price for the second home, not located in the Dallas portion of the metroplex, nine years old, and with 1848 square feet, was $79,500. Using Minitab and the EASTON data file, we identify the average selling price for all homes in the most recent three-month period as well as the average selling price for all homes sold during each of these three months: For all homes sold during the most recent three-month period: Descriptive Statistics: Price Variable N Mean SE Mean StDev Minimum Q1 Median Q3 Maximum Price 378 91367 895 17394 51800 78475 89400 102850 137100 For homes sold during each of the most recent three months: Descriptive Statistics: Price Variable Month N Mean SE Mean StDev Minimum Q1 Median Q3 Price 4 131 95649 1456 16661 60400 82200 96200 107000 5 127 90972 1570 17696 58100 78200 88900 102700 6 120 87112 1541 16883 51800 75650 85700 99075 Variable Month Maximum Price 4 137100 5 134100 6 131900 The prices of the two homes in question ($88,500 and $79,500) are below the mean price for all homes sold during the most recent three-month period ($91,367). However, the homes that are the subject of the controversy were sold in the most recent month (June, or month code 6), during which the mean price was just $87,112 in a declining market -- note the declining mean selling prices from month 4 through month 6. On this basis, it would not appear that the homes in question were very much different from the mean price for all homes sold during the most recent month, and one of them even sold for a higher price than the mean for the most recent month. b. There are a number of pricing factors that could make the comparison in part (a) unfair. In considering only the selling price, we are not taking into consideration other factors that could affect the price of a home. Such factors could include variables such as location, age, size, number of bedrooms, and many more variables of which real estate agents are well aware. Regarding location, we will see in our regression in part 2a that homes in Dallas sell for a rather large premium versus comparable homes sold elsewhere. c. In making their argument, the complaining sellers are relying heavily on the average selling price ($104,250) stated in the article for all homes sold in the area during the previous twelve months during a weakening housing market. Therein lies the weakest component of their argument. They sold their houses during the 12th month of a 1-year period during which housing prices in the area had been decreasing. 590 2. Using multiple regression to estimate Price as a function of SqFeet. Bedrooms, Age, Dallas, and Easton, we obtain the following Minitab printout for the most recent three months of home sales. a. Interpreting the partial regression coefficients: On average, the price tends to increase by $38.6 for each additional square foot of living space, by $358 for each additional bedroom, and by $48 for each additional year in age. Also, the price tends to be $21,282 higher if located in Dallas and $132 higher if sold by Easton rather than another realtor. The positive $132 coefficient for the Easton variable would appear to undermine accusations by the claimants that Easton has been engaging in a practice of underpricing its residential properties relative to other real estate companies. Especially noteworthy is the $21,282 premium for a home in Dallas versus elsewhere in the metroplex area, because neither of the disputed homes is located in Dallas. Regression Analysis: Price versus SqFeet, Bedrooms, Age, Dallas, Easton The regression equation is Price = 8309 + 38.6 SqFeet + 358 Bedrooms + 48 Age + 21282 Dallas + 132 Easton Predictor Coef SE Coef T P Constant 8309 2082 3.99 0.000 SqFeet 38.640 1.257 30.73 0.000 Bedrooms 357.8 664.8 0.54 0.591 Age 47.8 152.9 0.31 0.755 Dallas 21281.8 647.4 32.87 0.000 Easton 132 1060 0.12 0.901 S = 6069.24 R-Sq = 88.0% R-Sq(adj) = 87.8% Analysis of Variance Source DF SS MS F P Regression 5 1.00353E+11 20070528157 544.87 0.000 Residual Error 372 13702868976 36835669 Total 377 1.14056E+11 b. In this case, we will consider the fact that each of the homes in dispute was sold during the most recent month. Thus, the printout below will include only data for the most recent month (June, or month code 6). For each of the two homes that are the subject of complaints, the printout includes for each home a point estimate as well as 95% confidence and prediction intervals for homes having comparable characteristics and being sold by a realtor other than Easton. In the printout below, note that the “Dallas” predictive variable has been specified as 0 for each of the disputed homes, because neither is located in Dallas. Regression Analysis: Price versus SqFeet, Bedrooms, Age, Dallas, Easton The regression equation is Price = 3046 + 36.4 SqFeet + 1474 Bedrooms + 445 Age + 21456 Dallas + 624 Easton Predictor Coef SE Coef T P Constant 3046 3116 0.98 0.330 SqFeet 36.388 2.085 17.45 0.000 Bedrooms 1474 1091 1.35 0.179 Age 445.0 218.5 2.04 0.044 Dallas 21455.6 991.0 21.65 0.000 Easton 624 1491 0.42 0.677 S = 5044.68 R-Sq = 91.4% R-Sq(adj) = 91.1% Analysis of Variance Source DF SS MS F P Regression 5 31017720744 6203544149 243.77 0.000 Residual Error 114 2901162922 25448798 Total 119 33918883667 Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 591 1 88938 1277 (86408, 91469) (78630, 99247) 2 78718 905 (76926, 80510) (68565, 88871) Values of Predictors for New Observations New Obs SqFeet Bedrooms Age Dallas Easton 1 2190 3.00 4.00 0.000000 0.000000 2 1848 3.00 9.00 0.000000 0.000000 For the home that sold for $88,500, the point estimate is $88,938 for the selling price of a comparable home sold by another realtor. Also, referring to the prediction interval, we have 95% confidence that a comparable home sold by another realtor would have brought a price within the interval from $78,630 to $99,247. The price for which Easton sold the home is very close to the point estimate and well within the prediction interval. The point estimate and prediction interval provide no evidence that would tend to support the complaint being made by this seller. For the home that sold for $79,500, the point estimate is $78,718 for the selling price of a comparable home sold by another realtor. Also, referring to the prediction interval, we have 95% confidence that a comparable home sold by another realtor would have brought a price within the interval from $68,565 and $88,871. The price for which Easton sold the home is actually slightly more than the point estimate and is well within the prediction interval. The point estimate and prediction interval provide no evidence that would tend to support the complaint being made by this seller. c. In addition to the points made in item 2b, above, it should be noted that the regression equation that includes only June data shows a partial regression coefficient of +$624 for the Easton variable. On average, a home sold during June by Easton sold for $624 more than a comparable home sold by another Realtor. This is yet another point that refutes the arguments of the disgruntled sellers of the two homes in question. Based on the evidence presented above, it would not seem that Easton is underpricing its residential properties. CIRCUIT SYSTEMS, INC. (C) In Chapters 11 and 14, we visited Circuit Systems, Inc., a company that was concerned about the effectiveness of their new program for reducing the cost of absenteeism among hourly workers. In this chapter, we will be taking a different approach to analyzing their data. 1. We will first use a multiple regression model to estimate the number of days of sick leave this year as a function of two variables: days of sick leave taken last year and whether the employee is a participant in the exercise program. The Minitab printout is shown below. Regression Analysis: Sick_ThisYr versus Sick_LastYr, Exercise? The regression equation is Sick_ThisYr = 1.53 + 0.566 Sick_LastYr - 0.955 Exercise? Predictor Coef SE Coef T P Constant 1.5325 0.3529 4.34 0.000 Sick_LastYr 0.56577 0.02439 23.19 0.000 Exercise? -0.9549 0.2643 -3.61 0.000 S = 1.86447 R-Sq = 70.5% R-Sq(adj) = 70.3% Analysis of Variance Source DF SS MS F P Regression 2 1913.93 956.97 275.29 0.000 Residual Error 230 799.54 3.48 Total 232 2713.47 592 The significance of the overall regression is quite strong, with the p-value displayed as 0.000. Interpreting the partial regression coefficients in this model: On average, for a 1-day increase in the number of sick days a person took last year, the model will predict a 0.566-day increase in the number of sick days taken this year. This would indicate that the program is working in terms of reducing the number of sick days taken. On average, a person participating in the exercise program would tend to have 0.955 fewer sick days this year than a person not participating in the exercise program. Both signs are as we would have expected. On the basis of this regression analysis, the exercise program is worthy of continuation. However, keep in mind that we are only considering days of absence, not the total cost associated with absence, which includes the $200 subsidy for persons participating in the exercise program. 2. The regression model explains 70.5% of the variation in days of sick leave this year, so 29.5% of the variation in the number of sick days taken this year is not explained. Some variables that could probably help explain some of the as-yet unexplained variation are associated with the incentive package implemented by the company. Possible variables that are not in the database could include the employee’s level of work satisfaction, age, gender, family size, and length of commute. 593