Document Sample

Chapter 12: Multiple Regression and Model Building Where We’ve Been Introduced the straight-line model relating a dependent variable y to an independent variable x Estimated the parameters of the straight-line model using least squares Assesses the model estimates Used the model to estimate a value of y given x McClave: Statistics, 11th ed. Chapter 12: Multiple 2 Regression and Model Building Where We’re Going Introduce a multiple-regression model to relate a variable y to two or more x variables Present multiple regression models with both quantitative and qualitative independent variables Assess how well the multiple regression model fits the sample data Show how analyzing the model residuals can help detect problems with the model and the necessary modifications McClave: Statistics, 11th ed. Chapter 12: Multiple 3 Regression and Model Building 12.1: Multiple Regression Models The General Multiple Regression Model y 0 1 x1 2 x2 k xk where y is the dependent variable, x1 , x2 , ... , xk are the independent variables, E ( y ) 0 1 x1 2 x2 k xk is the deterministic portion of the model and i determines the contribution of the independent variable xi , which may be a quantitative variable of order one or higher or a qualitative variable McClave: Statistics, 11th ed. Chapter 12: Multiple 4 Regression and Model Building 12.1: Multiple Regression Models Analyzing a Multiple-Regression Model Step 1: Hypothesize the deterministic portion of the model by choosing the independent variables x1, x2, … , xk. Step 2: Estimate the unknown parameters 0, 1, 2, … , k . Step 3: Specify the probability distribution of and estimate the standard deviation of this distribution. McClave: Statistics, 11th ed. Chapter 12: Multiple 5 Regression and Model Building 12.1: Multiple Regression Models Analyzing a Multiple-Regression Model Step 4: Check that the assumptions about are satisfied; if not make the required modifications to the model. Step 5: Statistically evaluate the usefulness of the model. Step 6: If the model is useful, use it for prediction, estimation and other purposes. McClave: Statistics, 11th ed. Chapter 12: Multiple 6 Regression and Model Building 12.1: Multiple Regression Models Assumptions about the Random Error 1. The mean is equal to 0. 2. The variance is equal to 2. 3. The probability distribution is a normal distribution. 4. Random errors are independent of one another. McClave: Statistics, 11th ed. Chapter 12: Multiple 7 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters A First-Order Model in Five Quantitative Independent Variables E ( y ) 0 1 x1 2 x2 3 x3 4 x4 5 x5 where x1, x2, … , xk are all quantitative variables that are not functions of other independent variables. McClave: Statistics, 11th ed. Chapter 12: Multiple 8 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters A First-Order Model in Five Quantitative Independent Variables E ( y ) 0 1 x1 2 x2 3 x3 4 x4 5 x5 The parameters are estimated by finding the values for the ‘s that minimize SSE ( y y) . ˆ 2 McClave: Statistics, 11th ed. Chapter 12: Multiple 9 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters A First-Order Model in Five Quantitative Independent Variables Only a truly talented E ( y ) 0 1 x1 2 x(or 3 x3 4 x4 5 x5 mathematician 2 geek) would choose to solve the necessary system ofare estimated The parameters by hand. In practice, by equations simultaneous linear finding the values are leftthe the‘s that computers for to do complicated calculation required minimize by multiple regression models. SSE ( y y) ˆ 2 McClave: Statistics, 11th ed. Chapter 12: Multiple 10 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters A collector of antique clocks hypothesizes that the auction price can be modeled as y 0 1 x1 2 x2 where y auction price in dollars x1 age of clock in years x2 number of bidders. McClave: Statistics, 11th ed. Chapter 12: Multiple 11 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters Based on the data in Table 12.1, the least squares prediction equation, the equation that minimizes SSE, is y 1,339 12.74 x1 85.95 x2 ˆ SSE 516, 727 SSE 516, 727 s 2 17,818 n k 1 29 s 133.5 (the estimate for ) McClave: Statistics, 11th ed. Chapter 12: Multiple 12 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters Based on the data in Table 12.1, the least squares prediction equation, the equation that minimizes SSE, is y 1,339 12.74 x1 85.95 x2 ˆ SSE 516, 727 The estimate for 1 is interpreted as the SSE 516, 727 s in y 2 expected change 17,818 asfor 2 is The estimate given a one-unit n k 1 interpreted the 29 expected change in y x2 constants 133.5 (the estimate for ) change in x1 holding given a one-unit change in x2 holding x1 constant McClave: Statistics, 11th ed. Chapter 12: Multiple 13 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters Based on the data in Table 12.1, the least squares prediction equation, the equation that minimizes SSE, is y 1,339 12.74 x1 85.95 x2 ˆ SSE 516, 727 Since it makes no sense to sell a clock of age 0 at an auction with no bidders, SSE 516, 727 term has no meaningful the intercept s 2 in this example. interpretation17,818 n k 1 29 s 133.5 (the estimate for ) McClave: Statistics, 11th ed. Chapter 12: Multiple 14 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters Test of an Individual Parameter Coefficient in the Multiple Regression Model One-Tailed Test Two-Tailed Test H 0 : i 0 H 0 : i 0 H a : i ()0 H a : i 0 Rejection Region: t t ( t ) Rejection Region: t t /2 ˆ i Test statistic: t sˆ i where t and t /2 are based on n (k 1) degrees of freedom and n = number of observations k 1 = number of parameters in the model McClave: Statistics, 11th ed. Chapter 12: Multiple 15 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters Test of the Parameter Coefficient on the Number of Bidders H 0 : 2 0 H a : 2 0 Rejection Region: t t t.05 1.699 ˆ 2 85.953 Test statistic: t * 9.85 sˆ 8.729 2 McClave: Statistics, 11th ed. Chapter 12: Multiple 16 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters Test of the Parameter Coefficient on the Number of Bidders H 0 : 2 0 H a : 2 0 Rejection Region: t t t.05 1.699 Since t* > t, reject the null hypothesis. ˆ 2 85.953 Test statistic: t * 9.85 sˆ 8.729 2 McClave: Statistics, 11th ed. Chapter 12: Multiple 17 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters A 100(1- )% Confidence Interval for a Parameter ˆ i (t /2 ) sˆ i where t is based on n (k 1) degrees of freedom and n = number of observations k 1 = number of parameters in the model Valid inferences about i also require that the four assumptions about are satisfied. McClave: Statistics, 11th ed. Chapter 12: Multiple 18 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters A 100(1- )% Confidence Interval for 1 ˆ 1 t /2 sˆ 1 ˆ 1 t.05sˆ 1 12.74 1.699(.905) 12.74 1.54 McClave: Statistics, 11th ed. Chapter 12: Multiple 19 Regression and Model Building 12.2: The First-Order Model: Estimating and Making Inferences about the Parameters A 100(1- )% Confidence Interval for 1 ˆ 1 t /2 sˆ 1 ˆ 1 t.05sˆ 1 12.74 1.699(.905) 12.74 1.54 Holding the number of bidders constant, the result above tells us that we can be 90% sure that the auction price will rise between $11.20 and $14.28 for each 1-year increase in age. McClave: Statistics, 11th ed. Chapter 12: Multiple 20 Regression and Model Building 12.3: Evaluating Overall Model Utility Reject H 0 for i Do Not Reject H 0 for i Evidence of a linear There may be no relationship between y relationship between y and xi and xi Type II error occurred The relationship between y and xi is more complex than a straight-line relationship McClave: Statistics, 11th ed. Chapter 12: Multiple 21 Regression and Model Building 12.3: Evaluating Overall Model Utility The multiple coefficient of determination, R2, measures how much of the overall variation in y is explained by the least squares prediction equation. SSE SS yy SSE Explained variability R 1 2 SS yy SS yy Total variability McClave: Statistics, 11th ed. Chapter 12: Multiple 22 Regression and Model Building 12.3: Evaluating Overall Model Utility High values of R2 suggest a good model, but the usefulness of R2 falls as the number of observations becomes close to the number of parameters estimated. McClave: Statistics, 11th ed. Chapter 12: Multiple 23 Regression and Model Building 12.3: Evaluating Overall Model Utility The Adjusted Multiple Coefficient of Determination n 1 SSE n 1 R 1 2 SS yy 1 1 R2 n (k 1) n (k 1) a Ra2 adjusts for the number of observations and the number of parameter estimates. It will always have a value no greater than R2. McClave: Statistics, 11th ed. Chapter 12: Multiple 24 Regression and Model Building 12.3: Evaluating Overall Model Utility The Analysis-of-Variance F -Test H 0 : 1 2 k 0 H a : At least one i 0 ( SS yy SSE ) / k R2 / k Test Statistic: F SSE / n (k 1) (1 R 2 ) / n (k 1) Mean square (Model) Mean square (Error) where n is the sample size and k is the number of terms in the model. Rejection region: F F , with k numerator and n (k 1) denominator degrees of freedom. McClave: Statistics, 11th ed. Chapter 12: Multiple 25 Regression and Model Building 12.3: Evaluating Overall Model Utility The Analysis-of-Variance F -Test H 0 : 1 2 k 0 H a : At least one i 0 ( SS yy the null RejectingSSE ) / k hypothesis 2 R /k Test Statistic: F means that something in your SSE n ( explain (1 R 2 ) / n model/ helps k 1) variations in (k 1) Mean may be that another y, but it square (Model) model provides more reliable Mean square (Error) estimates and predictions. where n is the sample size and k is the number of terms in the model. Rejection region: F F , with k numerator and n (k 1) denominator degrees of freedom. McClave: Statistics, 11th ed. Chapter 12: Multiple 26 Regression and Model Building 12.3: Evaluating Overall Model Utility A collector of antique H 0 : 1 2 0 clocks hypothesizes H a : At least one of the that the auction price can be modeled as two coefficients is nonzero y 0 1 x1 2 x2 Test Statistic: where MS(Model) 2,141,531 F 120.19 y auction price in dollars MSE 17,818 x1 age of clock in years p value: less than .00001 x2 number of bidders McClave: Statistics, 11th ed. Chapter 12: Multiple 27 Regression and Model Building 12.3: Evaluating Overall Model Utility A collector of antique H 0 : 1 2 0 clocks hypothesizes H a : At least one of the that the auction price can be modeled as two coefficients is nonzero y 0 1 x1 2 x2 Test Statistic: where MS(Model) 2,141,531 F 120.19 y auction price in dollars MSE 17,818 x1 age of clock in years p value: less than .00001 x2 number of bidders Something in the model is useful, but the F-test can’t tell us which x- variables are individually useful. McClave: Statistics, 11th ed. Chapter 12: Multiple 28 Regression and Model Building 12.3: Evaluating Overall Model Utility Checking the Utility of a Multiple-Regression Model 1. Use the F-test to conduct a test of the adequacy of the overall model. 2. Conduct t-tests on the “most important” parameters. 3. Examine Ra2 and 2s to evaluate how well the model fits the data. McClave: Statistics, 11th ed. Chapter 12: Multiple 29 Regression and Model Building 12.4: Using the Model for Estimation and Prediction The model of antique clock prices can be used to predict sale prices for clocks of a certain age with a particular number of bidders. What is the mean sale price for all 150-year- old clocks with 10 bidders? McClave: Statistics, 11th ed. Chapter 12: Multiple 30 Regression and Model Building 12.4: Using the Model for Estimation and Prediction What is the mean auction The average value of all clocks sale price for a single 150- with these characteristics can be year-old clock with 10 found by using the statistical bidders? software to generate a confidence interval. (See Figure 12.7) In this case, the confidence interval indicates that we can be 95% sure that the average price of a single 150-year-old clock sold at auction with 10 bidders will be between $1,154.10 and $1,709.30. McClave: Statistics, 11th ed. Chapter 12: Multiple 31 Regression and Model Building 12.4: Using the Model for Estimation and Prediction McClave: Statistics, 11th ed. Chapter 12: Multiple 32 Regression and Model Building 12.4: Using the Model for Estimation and Prediction What is the mean sale price for a single 50- year-old clock with 2 bidders? McClave: Statistics, 11th ed. Chapter 12: Multiple 33 Regression and Model Building 12.4: Using the Model for Estimation and Prediction What is the mean sale price for a single 50- year-old clock with 2 bidders? Since 50 years-of-age and 2 bidders are both outside of the range of values in our data set, any prediction using these values would be unreliable. McClave: Statistics, 11th ed. Chapter 12: Multiple 34 Regression and Model Building 12.5: Model Building: Interaction Models In some cases, the impact of an independent variable xi on y will depend on the value of some other independent variable xk. Interaction models include the cross- products of independent variables as well as the first-order values. McClave: Statistics, 11th ed. Chapter 12: Multiple 35 Regression and Model Building 12.5: Model Building: Interaction Models An Interaction Model Relating E ( y ) to Two Quantitative Independent Variables E ( y ) 0 1 x1 2 x2 3 x1 x2 where 1 3 x2 represents the change in E ( y ) for every one-unit change in x1 holding x2 fixed and 2 3 x1 represents the change in E ( y ) for every one-unit change in x2 holding x1 fixed. McClave: Statistics, 11th ed. Chapter 12: Multiple 36 Regression and Model Building 12.5: Model Building: Interaction Models In the antique clock auction example, assume the collector has reason to believe that the impact of age (x1) on price (y) varies with the number of bidders (x2) . The model is now y= 0 + 1x 1 + 2x 2 + 3x 1x 2 + . McClave: Statistics, 11th ed. Chapter 12: Multiple 37 Regression and Model Building 12.5: Model Building: Interaction Models McClave: Statistics, 11th ed. Chapter 12: Multiple 38 Regression and Model Building 12.5: Model Building: Interaction Models In the antique clock auction example, assume the collector has reason to believe that the impact of age (x1) on price (y) varies The Global F -Test with the number of bidders H 0 : 1 2 3 (x2) . The test statistic is F = 193.04 The model is now p-value = 0 y= 0 + 1x1 + 2x2 + 3x1x2 + . Reject the null hypothesis McClave: Statistics, 11th ed. Chapter 12: Multiple 39 Regression and Model Building 12.5: Model Building: Interaction Models In the antique clock auction The MINITAB results are reported example, assume the in Figure 12.11 in the text. collector has reason to believe that the impact of age (x1) on price (y) varies The t -Test on the Interaction Parameter with the number of bidders H 0 : 3 0 (x2) . The test statistic is t = 6.11 (two-tailed) The model is now p -value = 0 (= 0/2 = 0 for a one-tailed test) y= 0 + 1x1 + 2x2 + 3x1x2 + . Reject the null hypothesis McClave: Statistics, 11th ed. Chapter 12: Multiple 40 Regression and Model Building 12.5: Model Building: Interaction Models In the antique clock auction example, assume the collector has reason to believe that the impact of age (x1) on price (y) varies with the number of bidders (x2) . The model is now y= 0 + 1x1 + 2x2 + 3x1x2 + . The Estimated Model is y 320.5 0.878 x1 (93.26) x2 1.2978 x1 x2 ˆ To estimate the change in the price of 150-year-old clock given a one-unit change in x2 , we must include the interction term. ˆ ˆ Estimated x slope x 93.26 1.30(150) 101.74 2 2 3 1 McClave: Statistics, 11th ed. Chapter 12: Multiple 41 Regression and Model Building 12.5: Model Building: Interaction Models Once the interaction term has passed the t- test, it is unnecessary to test the individual independent variables. McClave: Statistics, 11th ed. Chapter 12: Multiple 42 Regression and Model Building 12.6: Model Building: Quadratic and Other Higher Order Models A quadratic (second-order) model includes the square of an independent variable: y = 0 + 1x + 2x2 + . This allows more complex relationships to be modeled. McClave: Statistics, 11th ed. Chapter 12: Multiple 43 Regression and Model Building 12.6: Model Building: Quadratic and Other Higher Order Models A quadratic (second-order) model includes the square of an independent variable: y = 0 + 1x + 2 x 2 + . 1 is the shift parameter and 2 is the rate of curvature. McClave: Statistics, 11th ed. Chapter 12: Multiple 44 Regression and Model Building 12.6: Model Building: Quadratic and Other Higher Order Models Example 12.7 considers whether home size (x) impacts electrical usage (y) in a positive but decreasing way. The MINITAB results are shown in Figure 12.13. McClave: Statistics, 11th ed. Chapter 12: Multiple 45 Regression and Model Building 12.6: Model Building: Quadratic and Other Higher Order Models McClave: Statistics, 11th ed. Chapter 12: Multiple 46 Regression and Model Building 12.6: Model Building: Quadratic and Other Higher Order Models According to the results, the equation that minimizes SSE for the 10 observations is ˆ 1, 216.14 2.3989 x .00045 x 2 y Ra .9767 2 McClave: Statistics, 11th ed. Chapter 12: Multiple 47 Regression and Model Building 12.6: Model Building: Quadratic and Other Higher Order Models McClave: Statistics, 11th ed. Chapter 12: Multiple 48 Regression and Model Building 12.6: Model Building: Quadratic and Other Higher Order Models Since 0 is not in the range of the independent variable (a house of 0 ft2?), the estimated intercept is not meaningful. The positive estimate on 1 indicates a positive relationship, although the slope is not constant (we’ve estimated a curve, not a straight line). The negative value on 2 indicates the rate of increase in power usage declines for larger homes. McClave: Statistics, 11th ed. Chapter 12: Multiple 49 Regression and Model Building 12.6: Model Building: Quadratic and Other Higher Order Models The Global F-Test H0: 1= 2= 0 Ha: At least one of the coefficients ≠ 0 The test statistic is F = 189.71, p-value near 0. Reject H0. McClave: Statistics, 11th ed. Chapter 12: Multiple 50 Regression and Model Building 12.6: Model Building: Quadratic and Other Higher Order Models t-Test of 2 H0: 2= 0 Ha: 2< 0 The test statistic is t = -7.62, p-value = .0001 (two-tailed). The one-tailed test statistic is .0001/2 = .00005 Reject the null hypothesis. McClave: Statistics, 11th ed. Chapter 12: Multiple 51 Regression and Model Building 12.6: Model Building: Quadratic and Other Higher Order Models Complete Second-Order Model with Two Quantitative Independent Variables E(y) = 0 + 1x1 + 2x2 + 3x1x2 + 4x12 + 5x22 y-intercept Changing 1 Signs and values of Controls and 2 causes these parameters the rotation the surface to control the type of of the shift along the surface and the surface x1 and x2 axes rates of curvature McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building 52 12.6: Model Building: Quadratic and Other Higher Order Models McClave: Statistics, 11th ed. Chapter 12: Multiple Regression and Model Building 53 12.7: Model Building: Qualitative (Dummy) Variable Models Qualitative variables can be included in regression models through the use of dummy variables. Assign a value of 0 (the base level) to one category and 1, 2, 3 … to the other categories. McClave: Statistics, 11th ed. Chapter 12: Multiple 54 Regression and Model Building 12.7: Model Building: Qualitative (Dummy) Variable Models A Qualitative Independent Variable with k Levels y 0 1 x1 2 x2 k 1 xk 1 where xi is the dummy variable for level i + 1 and 1 if y is observed at level i 1 xi 0 otherwise A 0 1 B A B 0 1 2 C A C 0 2 3 D A j 0 j 1 j j A McClave: Statistics, 11th ed. Chapter 12: Multiple 55 Regression and Model Building 12.7: Model Building: Qualitative (Dummy) Variable Models For the golf ball example from Chapter 10, there were four levels (the brands).Testing differences in brands can be done with the model E ( y ) 0 1 x1 2 x2 3 x3 where 1 if Brand B 1 if Brand C 1 if Brand D x1 , x2 and x3 0 otherwise 0 otherwise 0 otherwise McClave: Statistics, 11th ed. Chapter 12: Multiple 56 Regression and Model Building 12.7: Model Building: Qualitative (Dummy) Variable Models Brand A is the base level, so 0 represents the mean distance ( A) for Brand A, and 1 = B - A 2 = C - A 3 = D - A McClave: Statistics, 11th ed. Chapter 12: Multiple 57 Regression and Model Building 12.7: Model Building: Qualitative (Dummy) Variable Models Testing that the four means are equal is equivalent to testing the significance of the s: H 0: 1 = 2 = 3 = 0 Ha: At least of one the s ≠ 0 McClave: Statistics, 11th ed. Chapter 12: Multiple 58 Regression and Model Building 12.7: Model Building: Qualitative (Dummy) Variable Models Testing that the four The test statistic is the F-statistic. means are equal is Here F = 43.99, p-value .000. equivalent to testing Hence we reject the null hypothesis that the golf balls all have the same the significance of mean driving distance. the s: H0: 1 = 2 = 3 = 0 Ha: At least of one the s ≠ 0 McClave: Statistics, 11th ed. Chapter 12: Multiple 59 Regression and Model Building 12.7: Model Building: Qualitative (Dummy) Variable Models Testing that the four The test statistic if the F-statistic. means are equal is Here F = 43.99, p-value .000. equivalent to testing Hence we reject the null hypothesis that the golf balls all have the same the significance of mean driving distance. the s: H0: 1 = 2 = 3 = 0 Remember that the maximum Ha: At least of one number of dummy variables is the s ≠ 0 one less than the number of levels for the qualitative variable. McClave: Statistics, 11th ed. Chapter 12: Multiple 60 Regression and Model Building 12.8: Model Building: Models with Both Quantitative and Qualitative Variables Suppose a first-order model is used to evaluate the impact on mean monthly sales of expenditures in three advertising media: television, radio and newspaper. Expenditure, x1, is a quantitative variable Types of media, x2 and x3, are qualitative variables (limited to k levels -1) McClave: Statistics, 11th ed. Chapter 12: Multiple 61 Regression and Model Building 12.8: Model Building: Models with Both Quantitative and Qualitative Variables E ( y ) 0 1 x1 2 x2 3 x3 4 x1 x2 4 x1 x3 where x1 advertising expenditure 1 if radio x2 0 otherwise 1 if television x3 0 otherwise Newspaper is the base level. McClave: Statistics, 11th ed. Chapter 12: Multiple 62 Regression and Model Building 12.8: Model Building: Models with Both Quantitative and Qualitative Variables E ( y ) 0 1 x1 2 x2 3 x3 4 x1 x2 4 x1 x3 Main effects, Main effects, Interaction advertising type of medium expenditure Newspaper medium line: E ( y ) 0 1 x1 Radio medium line: E ( y ) ( 0 2 ) ( 1 4 ) x1 Intercept Slope Television medium line: E ( y ) ( 0 3 ) ( 1 5 ) x1 Intercept Slope McClave: Statistics, 11th ed. Chapter 12: Multiple 63 Regression and Model Building 12.8: Model Building: Models with Both Quantitative and Qualitative Variables Suppose now a second-order model is used to evaluate the impact of expenditures in the three advertising media on sales. The relationship between expenditures, x1, and sales, y, is assumed to be curvilinear. McClave: Statistics, 11th ed. Chapter 12: Multiple 64 Regression and Model Building 12.8: Model Building: Models with Both Quantitative and Qualitative Variables E ( y ) 0 1 x1 x 2 2 1 where x1 advertising expenditure In this model, each medium is assumed to have the save impact on sales. McClave: Statistics, 11th ed. Chapter 12: 65 Multiple Regression and Model Building 12.8: Model Building: Models with Both Quantitative and Qualitative Variables E ( y ) 0 1 x1 2 x12 3 x2 4 x3 where In this model, the x1 advertising expenditure intercepts differ 1 if radio but the shapes x2 0 otherwise of the curves are the same. 1 if television x3 0 otherwise Newspaper is the base level. McClave: Statistics, 11th ed. Chapter 12: Multiple 66 Regression and Model Building 12.8: Model Building: Models with Both Quantitative and Qualitative Variables E ( y ) 0 1 x1 2 x12 3 x2 4 x3 5 x1 x2 6 x1 x3 7 x12 x2 8 x12 x3 In this model, the response curve for each media type is different – that is, advertising expenditure and media type interact, at varying rates. McClave: Statistics, 11th ed. Chapter 12: Multiple 67 Regression and Model Building 12.9: Model Building: Comparing Nested Models Two models are nested if one model contains all the terms of the second model and at least one additional term. The more complex of the two models is called the complete model and the simpler of the two is called the reduced model. McClave: Statistics, 11th ed. Chapter 12: Multiple 68 Regression and Model Building 12.9: Model Building: Comparing Nested Models Recall the interaction model relating the auction price (y) of antique clocks to age (x1) and bidders (x2) : E ( y ) 0 1 x1 2 x2 3 x1 x2 . McClave: Statistics, 11th ed. Chapter 12: Multiple 69 Regression and Model Building 12.9: Model Building: Comparing Nested Models If the relationship is not constant, a second-order model should be considered: E ( y ) 0 1 x1 2 x2 3 x1 x2 x x . 2 4 1 2 5 2 Reduced model Complete model McClave: Statistics, 11th ed. Chapter 12: Multiple 70 Regression and Model Building 12.9: Model Building: Comparing Nested Models If the complete model produces a better fit, then the s on the quadratic terms should be significant. E ( y) 0 1 x1 2 x2 3 x1 x2 x x . 2 4 1 2 5 2 Reduced model Complete model H0: 4 = 5 = 0 Ha: At least one of 4 and 5 is non-zero McClave: Statistics, 11th ed. Chapter 12: Multiple 71 Regression and Model Building 12.9: Model Building: Comparing Nested Models F-Test for Comparing Nested Models Reduced model: E ( y ) 0 1 x1 g xg Complete model: E ( y ) 0 1 x1 g xg g 1 xg 1 k xk H 0 : g 1 g 2 k 0 H a : At least one of the parameters in H 0 is nonzero Test Statistic: ( SSER SSEC ) / (k g ) ( SSER SSEC ) / # s in H 0 F SSEC / [n (k 1)] MSEC McClave: Statistics, 11th ed. Chapter 12: Multiple 72 Regression and Model Building 12.9: Model Building: Comparing Nested Models F-Test for Comparing Nested Models where SSER = sum of squared errors for the reduced model SSEC = sum of squared errors for the complete model MSEC = mean square error (s2) for the complete model k – g = number of parameters specified in H0 k + 1 = number of parameters in the complete model n = sample size Rejection region: F > F, with k – g numerator and n – (k + 1) denominator degrees of freedom. McClave: Statistics, 11th ed. Chapter 12: Multiple 73 Regression and Model Building 12.9: Model Building: Comparing Nested Models The growth of carnations (y) is assumed to be a function of the temperature (x1) and the amount of fertilizer (x2). The data are shown in Table 12.6 in the text. McClave: Statistics, 11th ed. Chapter 12: Multiple 74 Regression and Model Building 12.9: Model Building: Comparing Nested Models The growth of carnations (y) is assumed to be a function of the temperature (x1) and the amount of fertilizer (x2). The complete second order model is E ( y ) 0 1 x1 2 x2 3 x1 x2 4 x12 5 x2 2 The least squares prediction equation from Table 12.6 is rounded to y 5,127.90 31.10 x1 139.75 x2 .146 x1 x2 .133 x12 1.14 x2 ˆ 2 McClave: Statistics, 11th ed. Chapter 12: Multiple 75 Regression and Model Building 12.9: Model Building: Comparing Nested Models The growth of carnations (y) is assumed to be a function of the temperature (x1) and the amount of fertilizer (x2). To test the significance of the contribution of the interaction and second-order terms, use H0: 3 = 4 = 5 = 0 Ha: At least one of 3, 4 or 5 ≠ 0 This requires estimating the complete model in reduced form, dropping the parameters in the null hypothesis. Results are given in Figure 12.31. McClave: Statistics, 11th ed. Chapter 12: Multiple 76 Regression and Model Building 12.9: Model Building: Comparing Nested Models H 0 : 3 4 5 0 H a : At least one of the parameters in H 0 is nonzero Test Statistic: ( SSER SSEC ) / (k g ) ( SSER SSEC ) / # s in H 0 F SSEC / [n (k 1)] MSEC (6, 671.50852 59.17832) / 3 F 782.15 2.81802 Rejection region: F.05 3.07 McClave: Statistics, 11th ed. Chapter 12: Multiple 77 Regression and Model Building 12.9: Model Building: Comparing Nested Models H 0 : 3 4 5 0 H a : At least one of the parameters in H 0 is nonzero Test Statistic: ( SSER SSEC ) / (k g ) ( SSER SSEC ) / # s in H 0 F SSEC / [n (k 1)] MSEC (6, 671.50852 59.17832) / 3 F 782.15 2.81802 Rejection region: F.05 3.07 Reject the null hypothesis: the complete model seems to provide better predictions than the reduced model. McClave: Statistics, 11th ed. Chapter 12: Multiple 78 Regression and Model Building 12.9: Model Building: Comparing Nested Models A parsimonious model is a general linear model with a small number of parameters. In situations where two competing models have essentially the same predictive power (as determined by an F-test), choose the more parsimonious of the two. McClave: Statistics, 11th ed. Chapter 12: 79 Multiple Regression and Model Building 12.9: Model Building: Comparing Nested Models A parsimonious model is a general linear model with a small number of parameters. In situations where If the models are not nested, the choice is two competing models have more subjective, based on Ra2, s, and an essentially the samethe understanding of predictive power theory behind an F-test), choose (as determined by the model. the more parsimonious of the two. McClave: Statistics, 11th ed. Chapter 12: 80 Multiple Regression and Model Building 12.10: Model Building: Stepwise Regression It is often unclear which independent variables have a significant impact on y. Screening variables in an attempt to identify the most important ones is known as stepwise regression. McClave: Statistics, 11th ed. Chapter 12: Multiple 81 Regression and Model Building 12.10: Model Building: Stepwise Regression Step 1: For each xi, estimate E(y) = 0 + 1 xi For each xi, test i. The xi with the largest absolute t-score (x*) is the best one-variable predictor of y. Step 2: Estimate E(y) = 0 + 1 x* + 2 xj with the remaining k – 1 x-variables. The x-variable with highest absolute value of t is retained (x’). (Some software packages may drop x* upon re-testing.) Step 3: Estimate E(y) = 0 + 1 x* + 2 x’ + 3 xg with the remaining k – 2 x-variables as in Step 2. Continue until no remaining x-variables yield significant t-scores when included in the model. McClave: Statistics, 11th ed. Chapter 12: Multiple 82 Regression and Model Building 12.10: Model Building: Stepwise Regression Stepwise regression must be used with caution Many t-tests are conducted, leading to high probabilities of Type I or Type II errors. Usually, no interaction or higher-order terms are considered – and reality may not be that simple. McClave: Statistics, 11th ed. Chapter 12: Multiple 83 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions Regression analysis is based on the four assumptions about the random error considered earlier. 1. The mean is equal to 0. 2. The variance is equal to 2. 3. The probability distribution is a normal distribution. 4. Random errors are independent of one another. McClave: Statistics, 11th ed. Chapter 12: Multiple 84 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions If these assumptions are not valid, the results of the regression estimation are called into question. Checking the validity of the assumptions involves analyzing the residuals of the regression. McClave: Statistics, 11th ed. Chapter 12: Multiple 85 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions A regression residual ˆ is defined as the difference between an observed y- value and its corresponding predicted value: ˆ ˆ ˆ ˆ ( y y) y (0 1x1 2 x2 ˆ ˆ k xk ) McClave: Statistics, 11th ed. Chapter 12: Multiple 86 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions Properties of the Regression Residuals 1. The mean of the residuals is equal to 0. (Residuals) ( y y) 0 ˆ 2. The standard deviation of the residuals is equal to the standard deviations of the fitted regression model. (Residuals) ( y y ) 2 0 2 ˆ (Residuals) 2 SSE s MSE n (k 1) n (k 1) McClave: Statistics, 11th ed. Chapter 12: Multiple 87 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions If the model is misspecified, the mean of will not equal 0. Residual analysis may reveal this problem. The home-size electricity usage example illustrates this. McClave: Statistics, 11th ed. Chapter 12: Multiple 88 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions The plot of the first- while the quadratic order model shows model shows a a curvilinear more random residual pattern … pattern. McClave: Statistics, 11th ed. Chapter 12: 89 Multiple Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions A pattern in the residual plot may indicate a problem with the model. McClave: Statistics, 11th ed. Chapter 12: Multiple 90 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions A residual larger than 3s (in absolute value) is considered an outlier. Outliers will have an undue influence on the estimates. 1. Mistakenly recorded data 2. An observation that is for some reason truly different from the others 3. Random chance McClave: Statistics, 11th ed. Chapter 12: Multiple 91 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions A residual larger than 3s (in absolute value) is considered an outlier. Leaving an outlier that should be removed in the data set will produce misleading estimates and predictions (#1 & #2 above). So will removing an outlier that actually belongs in the data set (#3 above). McClave: Statistics, 11th ed. Chapter 12: Multiple 92 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions Residual plots should be centered on 0 and within ±3s of 0. Residual histograms should be relatively bell- shaped. Residual normal probability plots should display straight lines. McClave: Statistics, 11th ed. Chapter 12: Multiple 93 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions REGRESSION ANALYSIS IS ROBUST WITH RESPECT TO (SMALL) NONNORMAL ERRORS. Slight departures from normality will not seriously harm the validity of the estimates, but as the departure from normality grows, the validity falls. McClave: Statistics, 11th ed. Chapter 12: Multiple 94 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions If the variance of changes as y changes, the constant variance assumption is violated. McClave: Statistics, 11th ed. Chapter 12: Multiple 95 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions A first-order model is used to relate the salaries (y) of social workers to years of experience (x). McClave: Statistics, 11th ed. Chapter 12: Multiple 96 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions E ( y ) 0 1 x y 11,368.72 2141.38 x ˆ R .787 2 t1 13.31; p value 0 McClave: Statistics, 11th ed. Chapter 12: Multiple 97 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions The model seems to provide good predictions, but the residual plot reveals a non-random pattern: The residual increases as the estimated mean salary increases, violating the constant variance assumption McClave: Statistics, 11th ed. Chapter 12: Multiple 98 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions Transforming the dependent variable often stabilizes the residual Possible transformations of y Natural logarithm Square root sin-1y1/2 McClave: Statistics, 11th ed. Chapter 12: Multiple 99 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions McClave: Statistics, 11th ed. Chapter 12: Multiple 100 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions Steps in a Residual Analysis Plot the Plot the residuals residuals with a Plot the against each stem-and-leaf, residuals quantitative Examine the histogram or against independent residual plots normal predicted y- variable and for outliers probability plot values to check look for non- and check for for nonconstant random nonnormal variances patterns errors McClave: Statistics, 11th ed. Chapter 12: Multiple 101 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions Steps in a Residual Analysis Plot the residuals against each quantitative independent variable and look for non-random patterns McClave: Statistics, 11th ed. Chapter 12: Multiple 102 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions Steps in a Residual Analysis Examine the residual plots for outliers McClave: Statistics, 11th ed. Chapter 12: Multiple 103 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions Steps in a Residual Analysis Plot the residuals with a stem-and-leaf, histogram or normal probability plot and check for nonnormal errors McClave: Statistics, 11th ed. Chapter 12: Multiple 104 Regression and Model Building 12.11: Residual Analysis: Checking the Regression Assumptions Steps in a Residual Analysis Plot the residuals against predicted y- values to check for nonconstant variances McClave: Statistics, 11th ed. Chapter 12: Multiple 105 Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation Problem 1: Parameter Estimability Problem 2: Multicollinearity Problem 3: Extrapolation Problem 4: Correlated Errors McClave: Statistics, 11th ed. Chapter 12: Multiple 106 Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation McClave: Statistics, 11th ed. Chapter 12: Multiple 107 Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation Problem 1: Parameter Estimability If x does not take on a sufficient number of different values, no single unique line can be estimated. McClave: Statistics, 11th ed. Chapter 12: Multiple 108 Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation Problem 2: Multicollinearity Multicollinearity exists when two or more of the independent variables in a regression are correlated. If xi and xj move together in some way, finding the impact on y of a one-unit change in either of them holding the other constant will be difficult or impossible. McClave: Statistics, 11th ed. Chapter 12: Multiple 109 Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation Problem 2: Multicollinearity Multicollinearity can be detected in various ways. A simple check is to calculate the correlation coefficients (rij) for each pair of independent variables in the model. Any significant rij may indicate a multicollinearity problem. If severe multicollinearity exists, the result may be 1. Significant F-values but insignificant t-values 2. Signs on s opposite to those expected 3. Errors in estimates, standard errors, etc. McClave: Statistics, 11th ed. Chapter 12: Multiple 110 Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation The Federal Trade Commission (FTC) ranks cigarettes according to their tar (x1), nicotine (x2), weight in grams (x3) and carbon monoxide (y) content . 25 data points (see Table 12.11) are used to estimate the model E ( y ) 0 1 x1 2 x2 3 x3 . McClave: Statistics, 11th ed. Chapter 12: 111 Multiple Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation McClave: Statistics, 11th ed. Chapter 12: 112 Multiple Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation E ( y ) 0 1 x1 2 x2 3 x3 E ( y ) 3.202 .963x1 (2.63) x2 (0.13) x3 (See Figure 12.49) F = 78.98, p-value < .0001 t1= 3.97, p-value = .0007 t2= -0.67, p-value = .5072 t3= -0.3, p-value = .9735 McClave: Statistics, 11th ed. Chapter 12: 113 Multiple Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation E ( y ) 0 1 x1 2 x2 3 x3 E ( y ) 3.202 .963x1 (2.63) x2 (0.13) x3 (See Figure 12.49) The negative F = 78.98, p-value < .0001 signs on two variables and the t1= 3.97, p-value = .0007 insignificant t- values are t2= -0.67, p-value = .5072 suggestive of multicollinearity . t3= -0.3, p-value = .9735 McClave: Statistics, 11th ed. Chapter 12: 114 Multiple Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation The coefficients of correlation, rij, provide further evidence: rtar, nicotine = .9766 rtar, weight = .4908 rweight, nicotine = .5002 Each rij is significantly different from 0 at the = .05 level. McClave: Statistics, 11th ed. Chapter 12: 115 Multiple Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation Possible Responses to Problems Created by Multicollinearity in Regression Drop one or more correlated independent variables from the model. If all the xs are retained, Avoid making inferences about the individual parameters from the t-tests. Restrict inferences about E(y) and future y values to values of the xs that fall within the range of the sample data. McClave: Statistics, 11th ed. Chapter 12: 116 Multiple Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation Problem 3: Extrapolation The data used to estimate the model provide information only on the range of values in the data set. There is no reason to assume that the dependent variable’s response will be the same over a different range of values. McClave: Statistics, 11th ed. Chapter 12: Multiple 117 Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation Problem 3: Extrapolation McClave: Statistics, 11th ed. Chapter 12: Multiple 118 Regression and Model Building 12.12: Some Pitfalls: Estimability, Multicollinearity and Extrapolation Problem 4: Correlated Errors If the error terms are not independent (a frequent problem in time series), the model tests and prediction intervals are invalid. Special techniques are used to deal with time series models. McClave: Statistics, 11th ed. Chapter 12: Multiple 119 Regression and Model Building

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 1 |

posted: | 2/20/2012 |

language: | |

pages: | 119 |

OTHER DOCS BY dfhdhdhdhjr

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.