VIEWS: 30 PAGES: 49 POSTED ON: 8/20/2012
Chapter 12 Multiple Regression Analysis The basic ideas are the same as in Chapters 3 & 11. We have one response (dependent) variable, Y. The response (Y) is a quantitative variable. There are more than one predictors (independent variables): X1, X2, …, Xp where p = number of predictors in the model. o The predictors can be: Quantitative (as before) Categorical (new) Interaction terms (product of predictors) Powers of predictors (e.g. X 4 ). 2 In this course we will concentrate on o Reading computer output o Interpreting coefficients o Determining the order to interpret things. Chapter 12, Fall 2007 Page 1 of 49 Some Examples Example – 1: Suppose we want to predict temperature for different cities, based on their latitude and elevation. In this case, the response and the predictors are Y = temperature X1 = Latitude X2 = Elevation Possible models are With p = 2: y 1 x1 2 x2 (Stiff surface) With p = 3: y 1 x1 2 x2 3 x1 x2 (Twisted surface) Example – 2: We want to predict patients’ “well- being” from the dosage of medicine they take (mg.) using a quadratic model: y 1 x 2 ( x)2 Here X = Dosage of the active ingredient (in mg’s), and p = 2. Chapter 12, Fall 2007 Page 2 of 49 Example – 3: Suppose we want to predict Y = the highway mileage of a car using X1= its city mileage and X2= its size (a categorical variable) where, 0 if car is compact X2 1 if car is larger The model we may use is y 1 x2 2 x2 3 ( x1 x2 ) Note that the last term 3 ( x1 x2 ) is for interaction which allows for NON-parallel lines. Chapter 12, Fall 2007 Page 3 of 49 In general terms: The model: y 1 x2 2 x2 L p x p Assumptions: 1) ~ N(0, ) [ Error terms are iid normal with mean zero and constant standard deviation ]. 2) As a result of this, we have, Y ~ N(µY, ), for every combination of x1, x2, …, xp. That is, the response (Y) has a normal distribution with mean µY (that depends on the values of the independent variables, x’s) and a constant standard deviation, (that does not depend on the values of X’s). We use data to find the Fitted Equation or Prediction Equation y a b1 x2 b2 x2 L bp x p ˆ Chapter 12, Fall 2007 Page 4 of 49 ANOVA F-test: Overall test of “goodness” of the model Ho: 1 = 2 = 3 = … = p = 0 NOTHING GOOD in model Ha: at least one of ’s ≠ 0 SOMETHING IS GOOD. MSReg Test Statistic: F MSE P-Value from the tables of the F-distribution with df1 = p = degrees of freedom of MSReg df2 = n – p – 1 = degrees of freedom of MSE ANOVA for Multiple Regression Model Source df SS MSE F Regression SSReg MSReg p SSReg MSReg F (Model) p MSE Residual SSE n–p–1 SSE MSE (Error) n p 1 Total n–1 SST Chapter 12, Fall 2007 Page 5 of 49 Testing for Individual ’s: Computer output from Minitab: Regression Analysis Y vs. X1, X2, …, Xp Predictor Coef SE Coef T P Constant a SE(a) a/SE(a) . X1 b1 SE(b1) b1 /SE(b1) . X2 b2 SE(b2) b2 /SE(b2) . M M M M M Xp bp SE(bp) bp /SE(bp) . Estimates SE of the Test Statistic p-value of i estimate for Ho: i = 0 (2-sided) of i vs. Ha: i 0 Look at the p-value for each If p-value for i is small, then Xi is good If p-value for i is large, then the independent variable Xi is NOT ADDING any information to the model AFTER all other predictors are taken into account. Chapter 12, Fall 2007 Page 6 of 49 Example - 1: Let Y = height of a person, X1 = Length of right arm, X2 = Length of left arm. 1. Suppose after collecting data we obtained an ANOVA table with a small p-value. What does that mean? 2. What is the next step? 3. Let’s say you carried out individual t-tests on each of the slopes, 1 and 2 and found that the p-values for both are large, what does that mean? 4. Can you see a contradiction here? 5. When do we get such contradictory results? 6. So, when do we have multicollinearity? Chapter 12, Fall 2007 Page 7 of 49 Example – 2: Suppose we are interesting in predicting the GPA of students in college (CGPA) using 16 different predictor variables. Data were collected from a random sample of 59 college students. 1. What is the response variable in this problem? 2. What are the values of n and p? 3. What are Ho and Ha that you can test using the ANOVA table? 4. What is your decision, based on the following ANOVA table? What is your conclusion? Analysis of Variance Source DF SS MS F P Regression 16 3.3135 0.2071 1.99 0.037 Residual Error 42 4.3601 0.1038 Total 58 7.6736 5. What is the next step? 6. When do you NOT take the next step? Chapter 12, Fall 2007 Page 8 of 49 Now look at the following output from Minitab: Regression Analysis: CGPA versus Height, Gender, ... The regression equation is CGPA = 0.53 + 0.0194 Height + 0.047 Gender – 0.00163 Haircut – 0.042 Job + 0.0004 Studytime – 0.375 Smokecig + 0.0488 Dated + 0.546 HSGPA + 0.00315 HomeDist + 0.00069BrowseInternet – 0.00128 WatchTV – 0.0117 Exercise + 0.0140 ReadNwsP + 0.039 Vegan – 0.0139 PoliticalDeg – 0.0801 PoliticalAff 7. Can you make any decisions based on the above? Why or why not? Chapter 12, Fall 2007 Page 9 of 49 8. The following is another part of the Minitab output. Which predictor(s) is/are “good?” Predictor Coef SE Coef T P Constant 0.532 1.496 0.36 0.724 Height 0.01942 0.01637 1.19 0.242 Gender 0.0468 0.1429 0.33 0.745 Haircut – 0.001633 0.001697 –0.96 0.341 Job – 0.0418 0.1024 –0.41 0.685 Studytime 0.00043 0.01921 0.02 0.982 Smokecig – 0.3746 0.2249 –1.67 0.103 Dated 0.04881 0.07111 0.69 0.496 HSGPA 0.5457 0.1776 3.07 0.004 HomeDist 0.003147 0.003400 0.93 0.360 BrowseInternet 0.000689 0.001163 0.59 0.557 WatchTV –0.0012840 0.0009710 –1.32 0.193 Exercise –0.011657 0.005934 –1.96 0.056 ReadNewsP 0.01395 0.02272 0.61 0.543 Vegan 0.0392 0.1578 0.25 0.805 PoliticalDegree –0.01390 0.03185 –0.44 0.665 PoliticalAff –0.08006 0.07741 –1.03 0.307 S = 0.322198 R-Sq = 43.2% R-Sq(adj) = 21.5% Chapter 12, Fall 2007 Page 10 of 49 9. The following is the last part of the output. What does it tell us? Unusual Observations Obs Height CGPA Fit SE Fit Residual St Resid 28 67.0 2.9800 3.5898 0.2442 –0.6098 –2.90R 40 65.0 3.9300 3.3458 0.2176 0.5842 2.46R 59 62.0 2.5000 3.4718 0.1352 –0.9718 –3.32R R denotes an observation with a large standardized residual. Although the individual t-tests indicate that the GPA of the student in high school (HSGPA) and Exercise have coefficients (i) that are significantly different from zero, when tested one at a time, with p-values of 0.004 and 0.056, respectively (hence they seem to look good), we should look at all possible combinations of the 16 predictors, so as not to miss any combination that may give better results. It is almost impossible to do this by hand, but fortunately computers can do it for us. In this way, we can find the “best subset” of predictors that will give the “best” prediction equation. The Minitab output on the next page gives “all” possible subsets of regression models. Chapter 12, Fall 2007 Page 11 of 49 Best Subsets Regression: CGPA versus Height, Gender, ... Response is CGPA P B o r l o i P w t o s i l S e R c i t S H I E e a t H u m o n W x a l i H G a d o m t a e d D c e e i y k D H e e t r N V e a i n r t e a S D r c c e e g l g d c J i c t G i n h i w g r A Mallows h e u o m i e P s e T s s a e f Vars R-Sq R-Sq(adj) C-p S t r t b e g d A t t V e P n e f 1 25.5 24.2 0.1 0.31667 X 1 13.0 11.5 9.3 0.34217 X 2 31.6 29.2 -2.4 0.30613 X X 2 29.4 26.9 -0.8 0.31109 X X 3 33.8 30.2 -2.1 0.30389 X X X 3 33.7 30.0 -2.0 0.30423 X X X 4 35.7 31.0 -1.5 0.30223 X X X X 4 35.3 30.5 -1.2 0.30320 X X X X 5 37.3 31.4 -0.6 0.30132 X X X X X 5 37.0 31.1 -0.4 0.30198 X X X X X 6 38.3 31.2 0.6 0.30163 X X X X X X 6 38.3 31.2 0.6 0.30164 X X X X X X 7 39.6 31.3 1.7 0.30150 X X X X X X X 7 39.3 30.9 1.9 0.30231 X X X X X X X 8 40.4 30.8 3.1 0.30249 X X X X X X X X 8 40.4 30.8 3.1 0.30256 X X X X X X X X 9 41.5 30.8 4.2 0.30266 X X X X X X X X X 9 41.0 30.2 4.6 0.30395 X X X X X X X X X 10 41.9 29.8 6.0 0.30478 X X X X X X X X X X 10 41.8 29.7 6.0 0.30492 X X X X X X X X X X 11 42.2 28.7 7.7 0.30712 X X X X X X X X X X X 11 42.2 28.7 7.7 0.30715 X X X X X X X X X X X 12 42.6 27.6 9.4 0.30945 X X X X X X X X X X X X 12 42.6 27.6 9.5 0.30954 X X X X X X X X X X X X 13 42.9 26.4 11.2 0.31205 X X X X X X X X X X X X X 13 42.8 26.3 11.3 0.31229 X X X X X X X X X X X X X 14 43.1 25.0 13.1 0.31502 X X X X X X X X X X X X X X 14 43.0 24.9 13.1 0.31526 X X X X X X X X X X X X X X 15 43.2 23.4 15.0 0.31843 X X X X X X X X X X X X X X X 15 43.1 23.2 15.1 0.31866 X X X X X X X X X X X X X X X 16 43.2 21.5 17.0 0.32220 X X X X X X X X X X X X X X X X Observe that R2 never goes down when you add predictors to the model, whereas Adjusted R2 will go down when you add new predictors to a model that are not adding any information to the model. Chapter 12, Fall 2007 Page 12 of 49 Note that when there are more than 2 predictors in the regression model, the adjusted R2 does not change much from the model that has HGPA and Exercise as the predictors. Another consideration in model selection is “parsimony.” That is, the preferred model is one that is as simple as possible and with a high adjusted R2. Thus, it seems that “the best combination” of predictors is HGPA and Exercise. Now we need to work through the above steps again and see what we can say for a regression model that has only HGPA (X1) and Exercise (X2) as predictors. We obtain the following output from Minitab: Chapter 12, Fall 2007 Page 13 of 49 Regression Analysis: CGPA versus HSGPA, Exercise The regression equation is CGPA = 1.55 + 0.560 HSGPA - 0.0111 Exercise Predictor Coef SE Coef T P Constant 1.5489 0.5551 2.79 0.007 HSGPA 0.5599 0.1436 3.90 0.000 Exercise -0.011138 0.004985 –2.23 0.029 S = 0.306126 R-Sq = 31.6% R-Sq(adj) = 29.2% Analysis of Variance Source DF SS MS F P Regression 2 2.4256 1.2128 12.94 0.000 Residual (Error) 56 5.2479 0.0937 Total 58 7.6736 First, using ANOVA we test Ho: 1 = 2 = 0 against Ha: At least one of 1 and 2 is different from zero. Since the p-value < 0.0005, we reject Ho. The observed data give strong evidence that at least one of the two predictors is good in explaining the variation in CGPA. Next we carry out two independent tests: Ho: 1 = 0 vs. Ha: 1 0 and Ho: 2 = 0 vs. Ha: 2 0 Chapter 12, Fall 2007 Page 14 of 49 The regression equation is CGPA = 1.55 + 0.560 HSGPA - 0.0111 Exercise Predictor Coef SE Coef T P Constant 1.5489 0.5551 2.79 0.007 HSGPA 0.5599 0.1436 3.90 0.000 Exercise -0.011138 0.004985 –2.23 0.029 Using the above output we reject both of the null hypotheses, with p-value < 0.0005 for HGPA and the p-value = 0.029 for exercise. These decisions indicate that both of the predictors are “good” ones. Analyses of Residuals: Before we move on, we need to look at the last part of the output, which gives us some warning messages, based on an analysis of residuals: Unusual Observations Obs HSGPA CGPA Fit SE Fit Residual St Resid 3 3.00 3.6000 3.2176 0.1297 0.3824 1.38 X 9 3.50 2.8800 3.4808 0.0642 -0.6008 -2.01R 14 3.30 2.6000 2.7284 0.2647 -0.1284 -0.83 X 27 2.55 3.1400 2.9099 0.1840 0.2301 0.94 X 28 3.80 2.9800 3.6544 0.0445 -0.6744 -2.23R 59 3.60 2.5000 3.5424 0.0556 -1.0424 -3.46R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. The above output indicates that there are 3 influential observations and 3 observations that may also be influential. Let’s look at some graphs of the residuals to see what is happening: Chapter 12, Fall 2007 Page 15 of 49 Residual Plots for CGPA Normal Probability Plot of the Residuals Residuals Versus the Fitted Values 99.9 Standardized Residual 99 1 90 0 Percent 50 -1 10 -2 1 -3 0.1 -4 -2 0 2 4 2.7 3.0 3.3 3.6 3.9 Standardized Residual Fitted Value Histogram of the Residuals Residuals Versus the Order of the Data 16 Standardized Residual 1 12 Frequency 0 8 -1 -2 4 -3 0 -3 -2 -1 0 1 1 5 10 15 20 25 30 35 40 45 50 55 Standardized Residual Observation Order In the first panel (top-left) we see that the all but one of the residuals are close to the blue line, indicating that the assumption of the normality of residuals is supported by the data. The lowest dot on the LHS of this graph is the outlier or the influential observation. The second panel (top-right) shows that the standardized residuals are randomly scattered around the horizontal line (residual = 0), and all, except one (the outlier) are within 3 standard deviations of the mean, i.e., zero (as expected). The third panel (Low-left) is a histogram of the standardized residuals and support that the residuals have a normal distribution with zero mean and some constant variance. Finally, the last graph do not show any funnel shape, so the assumption of constant variance is supported. We can still see the outlier(s). Chapter 12, Fall 2007 Page 16 of 49 In order to see if including a quadratic or higher order term of one or both of the predictors might improve the model, we look at a scatter diagram of the standardized residuals vs. the predictors. These are given below: Residual Plots for CGPA Residuals Versus HSGPA Residuals Versus Exercise (response is CGPA) (response is CGPA) 2 2 1 1 Standardized Residual Standardized Residual 0 0 -1 -1 -2 -2 -3 -3 -4 -4 2.50 2.75 3.00 3.25 3.50 3.75 4.00 0 10 20 30 40 50 60 HSGPA Exercise We do not see any higher order relation between the residuals and the predictors. So we cannot improve the model by adding any other predictor. However, there is at least one observation that needs to be checked and corrected if possible, or removed from the data set otherwise. The question is which one of the observation should we look at and delete first (if we cannot find the reason why it has such a large residual)? [We delete observations one at a time because things may change after deleting one observation.] Chapter 12, Fall 2007 Page 17 of 49 Easiest way is to look at the plot of residuals against order of observations. Residuals Versus the Order of the Data (response is CGPA) 2 1 Standardized Residual 0 -1 -2 -3 -4 1 5 10 15 20 25 30 35 40 45 50 55 Observation Order We immediately see that the last observation (#14 in the data set) has the largest standardized residual and hence we should start with that. We see that this student practices for 60 hours per week and hence is far from others in the data set. The student who has the nearest X2 value, practices for 25 hours. Other students practice for 15 hours per week or less. Thus, this student is not typical at all. The following is the Minitab output, when the observations from this student are deleted. Chapter 12, Fall 2007 Page 18 of 49 Regression Analysis: CGPA versus HSGPA, Exercise Analysis of Variance Source DF SS MS F P Regression 2 1.45009 0.72504 7.69 0.001 Residual Error 55 5.18265 0.09423 Total 57 6.63274 Looking at the ANOVA table we decide to reject Ho and conclude that at least one of the two predictors is “good.” [Note the change in the degrees of freedom in ANOVA. Why should they change?] The regression equation is CGPA = 1.54 + 0.554 HSGPA - 0.00432 Exercise Predictor Coef SE Coef T P Constant 1.5388 0.5568 2.76 0.008 HSGPA 0.5542 0.1441 3.85 0.000 Exercise -0.004320 0.009596 -0.45 0.654 S = 0.306969 R-Sq = 21.9% R-Sq(adj) = 19.0% Test on the ’s one at a time show that the second predictor (Exercise) is not “good” since the corresponding p-value = 0.654 > larger than any reasonable . This means we should try a new model, without Exercise. Because we are going to change the model, we do not need to do anything based on the rest of the output. Unusual Observations Obs HSGPA CGPA Fit SE Fit Residual St Resid 3 3.00 3.6000 3.1970 0.1324 0.4030 1.45 X 25 3.50 3.3100 3.3705 0.1974 -0.0605 -0.26 X 26 2.55 3.1400 2.9261 0.1856 0.2139 0.87 X 27 3.80 2.9800 3.6361 0.0497 -0.6561 -2.17R 58 3.60 2.5000 3.5252 0.0594 -1.0252 -3.40R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. Chapter 12, Fall 2007 Page 19 of 49 Here is the output for the SLR model with HSGPA as the predictor: Regression Analysis: CGPA versus HSGPA The regression equation is CGPA = 1.50 + 0.560 HSGPA Predictor Coef SE Coef T P Constant 1.4964 0.5448 2.75 0.008 HSGPA 0.5596 0.1426 3.92 0.000 S = 0.304776 R-Sq = 21.6% R-Sq(adj) = 20.2% Analysis of Variance Source DF SS MS F P Regression 1 1.4310 1.4310 15.41 0.000 Residual Error 56 5.2017 0.0929 Total 57 6.6327 Both panels show that is significantly different from zero and hence we have a “reasonably good” model. [It is not really a good model, why?] Next we should look at the following panel??? of the output to see if we want to delete a few more observations to improve the model. We may also try to find other predictors so as to improve R2 which is around 20%, i.e., only 20% of the variation in CGPA is explained by changes in HSGPA, or alternatively we can say that HSGPA has reduced the error sum of squares only by 20%. Chapter 12, Fall 2007 Page 20 of 49 Categorical Variables in MLR Categorical variables (in multiple linear regression) are coded as 0 and 1. They are called dummy variables or indicator variables. When we want to compare a group of observations with a baseline group or a control group, we code the dummy variable as zero for that group. Example: Suppose we want to predict the wages of employees using the length of service. Thus we have a quantitative response ( Y = Wages) and a quantitative predictor (LOS = Length of service). Of course wages also depend on the size of the company that employs these workers, so let’s add another variable and call it SIZE. So we have, Y = Wages = Response variable X1 = LOS = Length of service (predictor 1) and X2 = SIZE = Size of company (a categorical variable coded small or large). We will use small companies as the baseline group, i.e., we will code the two categories of SIZE as X2 = 0 if the company is small and X2 = 1 if the company is large (not small). Chapter 12, Fall 2007 Page 21 of 49 Model: y 1 X 1 2 X 2 0 1 X 1 2 X 2 In the above model, when we substitute X2 = 0 we obtain a SLR model for small companies: y 1 X 1 Similarly, substituting X2 = 1 will give us another SLR for the large companies: y 0 1 X 1 2 ( 1 ) ( 0 2 ) 1 X 1 Observe that the difference between the two is only in the intercept: for small companies the intercept is 0 but for large companies it is (0 + 2). However, both models have the same slope 1. Thus we have two parallel lines. Interpretation of the coefficients of regression model: 0: intercept for the baseline group 1: Slope for both groups 2: Change (or difference) in intercept for the other group compared to the baseline. Chapter 12, Fall 2007 Page 22 of 49 The above lines were forced to be parallel by the choice of model. To allow for non-parallel lines we will add an interaction term to the model: A Multiple Regression Model with Interaction y 0 1 X 1 2 X 2 3 X 1 X 2 Now let’s see what we get when we substitute 0 or 1 for X2 in the above model: Small companies: Substitute X2 = 0 y 0 1 X 1 2 ( 0 ) 3 X 1 ( 0 ) 0 1 X 1 Intercept = 0, Slope = 1. Large companies: Substitute X2 = 1 y 0 1 X 1 2 ( 1 ) 3 X 1 ( 1 ) 0 1 X 1 2 3 X 1 ( 0 2 ) ( 1 3 )X 1 Intercept = 0 + 2, Slope = 1 + 3 Interpretations: 0: Intercept for the baseline group 1: Slope of the baseline group 3: Change in intercept for the other group compared to baseline 3: Change in slope for the other group compared to baseline Interaction term allows for non-parallel lines. Chapter 12, Fall 2007 Page 23 of 49 Steps: 1. Start with a model that has the interaction term. 2. Using the ANOVA table, test Ho: 1 = 2 = 3 = 0 vs. Ha: At least one i 0 3. Test Ho: 3 = 0 vs. Ha: 3 0 a. If the null hypothesis is not rejected then fit a simpler model with no interaction term. b. If null hypothesis is rejected then keep the interaction term in the model. What if there are 3 levels for the categorical predictor? Suppose we have 3 categories of SIZE (small, medium and large) in addition to the quantitative predictor LOS to predict the wages. Then we need two dummy variables for SIZE: 1 medium 1 large X2 and X3 0 Otherwise 0 Otherwise Now let’s first look at a MLR model with no interaction terms: Chapter 12, Fall 2007 Page 24 of 49 MLR Model without Interaction y 0 1 X 1 2 X 2 3 X 3 Small companies: X2 = 0, X3 = 0 y 0 1 X 1 2 ( 0 ) 3 ( 0 ) 0 1 X 1 Medium companies: X2 = 1, X3 = 0 y 0 1 X 1 2 ( 1 ) 3 ( 0 ) ( 0 2 ) 1 X 1 Large companies: X2 = 0, X3 = 1 y 0 1 X 1 2 ( 0 ) 3 ( 1 ) ( 0 3 ) 1 X 1 Interpretation: 0: Intercept for baseline (small) 1: Slope for all 3 2: Change in intercept for medium vs. small 3: Change in slope for large vs. small Chapter 12, Fall 2007 Page 25 of 49 MLR Model with interaction: y 0 1 X 1 2 X 2 3 X 3 4 X 1 X 2 5 X 1 X 3 Small companies: X2 = 0, X3 = 0 y 0 1 X 1 2 X 2 3 X 3 4 X 1 X 2 5 X 1 X 3 0 1 X 1 2 ( 0 ) 3 ( 0 ) 4 X 1 ( 0 ) 5 X 1 ( 0 ) 0 1 X 1 Medium companies: X2 = 1, X3 = 0 y 0 1 X 1 2 X 2 3 X 3 4 X 1 X 2 5 X 1 X 3 0 1 X 1 2 ( 1 ) 3 ( 0 ) 4 X 1 ( 1 ) 5 X 1 ( 0 ) ( 0 2 ) ( 1 4 ) X 1 Small companies: X2 = 0, X3 = 0 y 0 1 X 1 2 X 2 3 X 3 4 X 1 X 2 5 X 1 X 3 0 1 X 1 2 ( 0 ) 3 ( 1 ) 4 X 1 ( 0 ) 5 X 1 ( 1 ) ( 0 3 ) ( 1 5 )X 1 How do you interpret the ’s? Chapter 12, Fall 2007 Page 26 of 49 Example: Wages vs Length of Service and Size of Company Coding of size of company: small = 0 large = 1 A model with interaction term: Regression Analysis: Wages versus LOS, size, LOS*size Analysis of Variance Source DF SS MS F P Regression 3 2438.1 812.7 6.76 0.001 Residual Error 56 6728.3 120.1 Total 59 9166.4 The above ANOVA table tells us that at least one of the regression coefficients (’s) is significantly different from zero, but that is no help, since it does not say which one(s). Let’s look at the next panel that gives the estimated coefficients, test statistics and more: Chapter 12, Fall 2007 Page 27 of 49 The regression equation is Wages = 35.9 + 0.104 LOS + 13.6 size - 0.0483 LOS*size Predictor Coef SE Coef T P Constant 35.914 3.562 10.08 0.000 LOS 0.10424 0.03632 2.87 0.006 size 13.631 4.910 2.78 0.007 LOS*size -0.04828 0.05634 -0.86 0.395 S = 10.9612 R-Sq = 26.6% R-Sq(adj) = 22.7% Since the test for the coefficient of the interaction term (LOS*size) has a large p-value, we fail to reject the hypothesis of no interaction (Ho: 3 = 0), hence try a model with no interaction term. The model with no interaction: Regression Analysis: Wages versus LOS, size The regression equation is Wages = 37.5 + 0.0842 LOS + 10.2 size Predictor Coef SE Coef T P Constant 37.466 3.061 12.24 0.000 LOS 0.08417 0.02770 3.04 0.004 size 10.228 2.882 3.55 0.001 S = 10.9357 R-Sq = 25.6% R-Sq(adj) = 23.0% In the above output we see that both coefficients are significantly different from zero so this is the model we want (or is it?). We decide to keep both variables in the model. However, because adjusted R2 = 23% is too small we are too happy with the model. Chapter 12, Fall 2007 Page 28 of 49 What else do we need to do? Look at the residuals to see if they give us any suggestions: Unusual Observations Obs LOS Wages Fit SE Fit Residual St Resid 15 70 97.68 53.59 1.85 44.09 4.09R 22 222 54.95 56.15 4.57 -1.21 -0.12 X 29 98 34.34 55.94 2.05 -21.60 -2.01R 42 228 67.91 56.66 4.71 11.25 1.14 X 47 204 50.17 64.87 4.26 -14.69 -1.46 X R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. R2 may increase by removing (after trying to see if we can do any correction) observation number 15 from the data set. Alternatively we could try to find other predictors and start all over. Chapter 12, Fall 2007 Page 29 of 49 Example: Reaction Time in a Computer Game vs Distance to move mouse and Hand used. Coding of hand: right = 0 left = 1 A model with interaction: Analysis of Variance Source DF SS MS F P Regression 3 136948 45649 17.82 0.000 Residual Error 36 92198 2561 Total 39 229146 What does the ANOVA table tell us? Now let’s have a look at the estimates: Regression Analysis: time versus distance, hand, dist*hand The regression equation is time = 99.4 + 0.028 distance + 72.2 hand + 0.234 dist*hand Predictor Coef SE Coef T P Constant 99.36 25.25 3.93 0.000 distance 0.0283 0.1308 0.22 0.830 hand 72.18 35.71 2.02 0.051 dist*hand 0.2336 0.1850 1.26 0.215 S = 50.6067 R-Sq = 59.8% R-Sq(adj) = 56.4% Chapter 12, Fall 2007 Page 30 of 49 Since the test on the coefficient of the interaction term has a large p-value we fail to reject Ho: 3 = 0 and use a model with no interaction term. [Ignore the rest of the output.] Unusual Observations Obs distance time Fit SE Fit Residual St Resid 25 163 315.00 214.29 11.38 100.71 2.04R 30 271 401.00 242.65 17.19 158.35 3.33R 31 40 320.00 182.09 20.68 137.91 2.99R R denotes an observation with a large standardized residual. The model with no interaction: Regression Analysis: time versus distance, hand Analysis of Variance Source DF SS MS F P Regression 2 132865 66433 25.53 0.000 Residual Error 37 96281 2602 Total 39 229146 The ANOVA tables tells us that at least one of the ’s is significantly different from zero. But which one(s)? The regression equation is time = 79.2 + 0.145 distance + 112 hand Predictor Coef SE Coef T P Constant 79.21 19.72 4.02 0.000 distance 0.14512 0.09324 1.56 0.128 hand 112.50 16.13 6.97 0.000 S = 51.0116 R-Sq = 58.0% R-Sq(adj) = 55.7% Since the p-value for the test on the coefficient of distance is large, we fail to reject Ho: 1 = 0. This leaves HAND as the only predictor in the model. Chapter 12, Fall 2007 Page 31 of 49 You may try to see if removing observation 30 will change this result. It didn’t. Unusual Observations Obs distance time Fit SE Fit Residual St Resid 25 163 315.00 215.39 11.44 99.61 2.00R 30 271 401.00 231.10 14.67 169.90 3.48R 31 40 320.00 197.55 16.80 122.45 2.54R So, what is next?? Simple Linear Regression: Regression Analysis: time versus hand The regression equation is time = 104 + 112 hand Predictor Coef SE Coef T P Constant 104.25 11.62 8.97 0.000 hand 112.50 16.43 6.85 0.000 S = 51.9573 R-Sq = 55.2% R-Sq(adj) = 54.1% Analysis of Variance Source DF SS MS F P Regression 1 126562 126562 46.88 0.000 Residual Error 38 102583 2700 Total 39 229146 The above output indicates that both 0 and 1 are significantly different from zero. How do you interpret that ’s? Unusual Observations Obs hand time Fit SE Fit Residual St Resid 30 1.00 401.00 216.75 11.62 184.25 3.64R 31 1.00 320.00 216.75 11.62 103.25 2.04R 32 1.00 113.00 216.75 11.62 -103.75 -2.05R R denotes an observation with a large standardized residual. You may try to see if removing observation # 30 will change the results. [It didn’t.] Chapter 12, Fall 2007 Page 32 of 49 We may use the methods of Chapter 13 in this case since the predictor is a categorical variable. Observe that we have the same ANOVA table in both cases. Testing the hypothesis of equal population means using the methods of Chapter 13: One-way ANOVA: time versus hand Source DF SS MS F P hand 1 126563 126563 46.88 0.000 Error 38 102584 2700 Total 39 229146 S = 51.96 R-Sq = 55.23% R-Sq(adj) = 54.05% Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev +---------+---------+---------+--------- 0 20 104.25 8.25 (-----*-----) 1 20 216.75 73.01 (-----*-----) +---------+---------+---------+--------- 80 120 160 200 Pooled StDev = 51.96 The above graph shows that the CIs for the means of the two populations do not overlap, indicating that there is a significant difference in the means of the two populations. Since there are only 2 populations (define them) we can also test the hypothesis of no difference between the two population means using the methods of Chapter 9. Minitab gives the output on the next page: Chapter 12, Fall 2007 Page 33 of 49 Testing hypothesis of no difference between the two population means using methods of Chapter 9: Two-Sample T-Test and CI: time, hand Two-sample T for time hand N Mean StDev SE Mean 0 20 104.25 8.25 1.8 1 20 216.8 73.0 16 Difference = mu (0) – mu (1) Estimate for difference: –112.500 95% CI for difference: (–146.889, –78.111) T-Test of difference = 0 (vs not =): T-Value = -6.85 P-Value = 0.000 DF=19 How will you interpret the above output? In the last example, you have seen different ways of getting the same results. Can you see the differences and similarities between them? Chapter 12, Fall 2007 Page 34 of 49 12.6 Logistic Regression (This section will be included in Exam 3.) This model is used when response is categorical with two categories (Yes = “Success” and No = “Failure”). There may be one or more predictors, one of which is quantitative. Example: suppose you want to predict whether a person has cancer or not (Yes/No) based on the number of cigarettes s/he smokes (Quantitative), gender (Male/Female), Age (Quantitative) and an index of family history (Quantitative). Then, Response: Y = Have Cancer? (Categorical) Predictors: X1 = Number of cigarettes (Quantitative) X2 = Gender (Categorical) X3 = Age (Quantitative) X4 = Family History Index (Quantitative) To analyze such data, we will express the probabilities in terms of “odds” or “odds ratio” and use the logarithms (to base e) of these odds. [That is the reason the model is called the logistic regression model.] Chapter 12, Fall 2007 Page 35 of 49 We will concentrate on transforming the data o P(“Success”) to “Odds” to “Log Odds” o Interpreting computer output Let’s first clarify the concepts of “odds” and “logit function” or “log odds”: Transforming the data: For each unit in the sample, we will have observation(s) on X(s) as well as an observation on Y (a dummy variable or an indicator) where, Y = 1 if response is “Yes” (= “Success”) and Y = 0 if response is “No” (= “Failure”) Then the number of “Success”s in the sample is n n Y i Y and hence i 1 i p ˆ n i 1 = sample proportion. [Note that the sample proportion is the sample mean of a binary (Bernoulli) variable!] The “Odds” or “Odds ratio” is the ratio of the probability of “Success” to the probability of pˆ “Failure”, estimated by . 1 pˆ Chapter 12, Fall 2007 Page 36 of 49 The LOG ODDS (or the Logit function) is defined as the natural logarithm of the odds ratio, i.e., p ˆ p ˆ LOG ODDS = log e ln ˆ 1 p ˆ 1 p Interpreting “Odds” Example – 1: Suppose the “odds of having a disease” is 0.33, that is, ODDS = 0.33 = 1/3 = 1 : 3 This means 1 person has disease for every 3 who don’t. Thus, ˆ p = probability that a person has disease Number of people who have the disease Total number of people (haves+don't haves) 1 1 0.25 25% 1 3 4 Example – 2: ODDS = 0.5 = ½ = 1 : 2 ˆ p = probability that a person has disease 1 1 = 0.33 33% 1 2 3 Chapter 12, Fall 2007 Page 37 of 49 Example – 3: ODDS = 1.5 = 15/10 = 3/2 = 3:2 ˆ p = probability that a person has disease 3 3 = 0.60 60% 3 2 5 ˆ Note that ODD > 1 means p > 50% Working Backwards: Example – 1: pˆ 0.9 9 Supose p 0.90 then ODDS ˆ 9 , that 1 p 0.1 1 ˆ is ODDS = 9:1. So, LogODDS = ln(9) = 2.1972 Let’s work backwards ˆ Find p , when LogODDS is given as 2.1972. When LogODDS = 2.1972, we take the exponent (the opposite of natural logs) of both sides to get ODDS = elogODDS = e2.1972 = 8.9998 = 9 = 9 : 1 9 So, p ˆ 0.9 9 1 Chapter 12, Fall 2007 Page 38 of 49 Example – 2: Suppose p 0.45, then ˆ pˆ 0.45 ODDS 0.8181 1 p 0.55 ˆ 45 9 5 9 9 :11 55 11 5 11 LogODDS = ln(0.8181) = – 0.20067 Let’s work backwards ˆ Find p , when LogODDS = – 0.20067 is given. When LogODDS = – 0.20067 ODDS = elogODDS = e– 0.20067 = 0.8181 = 0.82 = 82/100 = 82:100 82 82 Hence, p ˆ 0.45 82 100 182 Chapter 12, Fall 2007 Page 39 of 49 Example – 3: Suppose computer output reports LogODDS = 1.57 What is the sample proportion? When LogODDS = 1.57, ODDS = e1.57 = 4.81 That is ODDS = 481/100 = 481 : 100 481 So, p ˆ 0.828 82.8% 481 100 e X Logistic Regression Model: p X 1 e Here p = P(“Success”) = P(Y = 1) and X is the predictor (a quantitative variable). Then, the LogODDS (Logit) function is p Log X 1 p Although the right hand side of the above equation looks like SLR, a scatter diagram of X against p is an S-shaped curve. In the above model, is NOT the slope, although its sign tells us whether there is an increasing ( > 0) or a decreasing ( < 0) relation between X and p. Chapter 12, Fall 2007 Page 40 of 49 Fitted equation: p ˆ Log a bX 1 p ˆ Interpretation of b (we don’t interpret a): b = LogODDS of “Success” (Check the sign) Is b is significantly different from zero? We can test this by Ho: = 0 vs. Ha: ≠ 0. ODDS ratio = eb . Thus, b gives us the change in ODDS as X increases by one unit. Is ODDS ratio (eb) significantly different from one? If we fail to reject Ho: = 0 vs. Ha: ≠ 0, then this means is NOT significantly different from zero and hence e is not significant different from 1. [This means probability of “Success” is the same for all values of X.] On the other hand, if the p-value reported by computer is small, we reject Ho: = 0 vs. Ha: ≠ 0, which implies ≠ 0 and hence e ≠ 1 and hence probability of “Success” is different for different X values. Chapter 12, Fall 2007 Page 41 of 49 Note that = 0 means there is no linear relationship between X and LogODDS, hence no relationship between p and X. Example: How does age affect the chances of developing osteoporosis? Data: Age = X = Predictor 72 85 84 … Yes No Yes … Y = Osteoporosis? 1 0 1 … Logistic Regression Model: p Log a bX 0 1 X 1 p Let us interpret the Minitab output on the next page. Chapter 12, Fall 2007 Page 42 of 49 Minitab Output: Logistic Regression of Osteoporosis (yes=1, no=0) on Age (in years) Logistic Regression Table Odds 95% CI Predictor Coef SE Coef Z P Ratio Lower Upper Constant -4.353 2.4865 1.75 0.0802 age 0.038 0.0072 5.28 0.0000 1.04 1.02 1.05 Fitted Equation (from Minitab) p ˆ Log 4.353 0.038 Age 1 p ˆ Testing the hypothesis that age has no effect on developing osteoporosis, i.e., Ho: = 0 vs. Ha: ≠ 0 (Z-test) Computer gives p-value 0.000. So we reject Ho. Age is a “good” predictor of whether a woman will develop osteoporosis. b = 0.38 (Do not interpret) Interpret eb = e0.038 = 1.039 [On output ODDS RATIO = 1.04] Interpretation: As age increases by one year, the odds of getting osteoporosis are 1.04 times what they were the year before. 95% CI for ODDS Ratio: (1.02, 1.05) Since CI does not contain one, age is a “good” predictor of osteoporosis. Chapter 12, Fall 2007 Page 43 of 49 Predict the chance (probability) of getting osteoporosis at age 65 and also at age 75. a) When X = 65 p ˆ LogODDS Log 4.353 0.038 Age 1 p ˆ 4.353 0.038 65 1.883 ODDS = e-1.883 = 0.152 0.15 = 15/100 15 So p ˆ 0.13 , i.e., 13% of women aged 65 15 100 have osteoporosis. Another way: e X Using the model p X we can estimate p as 1 e 4.353 0.038(65) e 0.152 p ˆ 4.353 0.038(65) 0.13 1 e 1.152 Chapter 12, Fall 2007 Page 44 of 49 b) When X = 75 p ˆ LogODDS = log 4.353 0.38(65) 1.503 1 p ˆ Then, ODDS = e-1.503 = 0.22 and hence, ODDS 0.22 p ˆ 0.18; that is, 18% of women 1 ODDS 1.22 aged 75 will have osteoporosis. Chapter 12, Fall 2007 Page 45 of 49 Multiple Logistic Regression Model Example: Predicting chances of cancer (for a population at a very high risk) from age and smoking status: Response = Y (Binary, categorical with 2 options) 1 Cancer Y 0 Not Predictors: X1 = Age (Quantitative) 1 Smoke X2 (Binary) 0 Not Multiple Logistic Regression Model: p Log 0 1 ( Age) 2 ( Smoking ) 1 p Here is a Minitab output. Let’s interpret it: Chapter 12, Fall 2007 Page 46 of 49 Minitab output: Example - Logistic Regression of Cancer (yes=1, no=0) on Age (in years) and Smoking (yes=1, no=0) Logistic Regression Table Odds 95% CI Predictor Coef SE Coef Z P Ratio Lower Upper Constant -4.4777 2.7465 1.63 0.1032 age 0.1123 0.0386 2.91 0.0036 1.12 1.04 1.21 smoking 1.1638 0.4537 2.57 0.0103 3.21 1.32 7.79 Log-Likelihood = -137.18596 Test that all slopes are zero: G = 18.8479, DF = 2, P-Value = 0.000 Fitted Equation: p ˆ Log b0 b1 ( Age) b2 ( Smoking ) 1 p ˆ 4.4777 0.1123( Age) 1.1638( Smoking ) Inferences about Age: The p-value = 0.0036 < 0.01 Thus age has a significant effect on probability of getting cancer. Odds Ratio = 1.12 means each year as one gets older, the ODDS of getting cancer is 1.12 times what it was during the previous year. CI for ODDS: (1.04, 1.21) Does not include 1 hence age has a significant effect on cancer Chapter 12, Fall 2007 Page 47 of 49 Inferences on Smoking: P-value = 0.0103 hence at α = 0.05 and α = 0.10 2 is significantly different from zero, thus smoking has a significant effect on cancer. ODDS Ratio = 3.21, that is ODDS of getting cancer for smokers is 3.21 times what they are for non-smokers (at the same age). CI for ODDS: (1.32, 7.79) does not include zero so significant. ODDS of getting cancer for smokers may be as high as 7.79 times as that for non-smokers. [Are you still smoking?] Predicting the probability of Cancer for an 80 year old non-smoker (X1 = 80, X2 = 0) p ˆ Log 4.4777 0.1123( Age) 1.1638( Smoking ) ˆ 1 p 4.4777 0.1123(80) 1.1638(0) 4.5 ODDS = e4.5 = 90.6 90.6 Probability of cancer = 0.9891 98.91% 1 90.6 Chapter 12, Fall 2007 Page 48 of 49 Homework: 1. Find the probability of cancer for a 75 year old smoker (Answer = 99%) 2. Find the probability of cancer for a 40 year old non-smoker (Answer = 50%) 3. Do you think this output is useful to predict the probability for a smoker at your age? Why or why not? [You need to make some assumptions before you can answer this.] 4. Use the output below to interpret the numbers and find if the proportion of binge drinkers differ by gender. 5. Can you answer the same question using the same output by two other methods? [You should not have difficulty to answer this question since we have seen those methods in Chapters 9 and 10.] Example Logistic Regression of Frequent Binge Drinking (yes=1, no=0) on Gender (males=1, females=0) Gender YES NO Total Male 1630 7180 Female 1684 9916 Total 3314 17096 Logistic Regression Table Odds 95% CI Predictor Coef SE Coef Z P Ratio Lower Upper Constant -1.58686 0.0267449 -59.33 0.000 gender 0.361639 0.0388452 9.31 0.000 1.44 1.33 1.55 Chapter 12, Fall 2007 Page 49 of 49