VIEWS: 0 PAGES: 41 POSTED ON: 7/23/2013 Public Domain
S l i d e Multiple Regression 1 Key Points about Multiple Regression Sample Homework Problem Solving the Problem with SPSS Logic for multiple Regression S l i d e Key points about multiple regression 2 Ø Few, if any, phenomena in social and behavioral research can be explained with a single predictor. More realistically, social phenomena are very complex, requiring a number of predictors to model the relationship. Ø Multiple regression is an extension of simple linear regression that enables us to include multiple predictors in our regression equation. The interpretation of a multiple regression is very similar to the interpretation of a simple linear regression, but there are important differences. S l i d e Similarities and differences - 1 3 Ø In both simple linear and multiple regression, there is an ANOVA test of the overall relationship. Ø In both simple linear and multiple regression, R2 represents the proportion of variance explained (error reduced) in predicting the dependent variable based on the independent variable. Ø In both simple linear and multiple regression, Multiple R represents the strength of the relationship and the effect size. In multiple regression it is always positive and is not equal to any of the beta coefficients. S l i d e Similarities and differences - 2 4 Ø In simple linear regression, the significance of the overall relationship and the relationship of each independent variable were the same. In multiple regression, there is a test of significance for the coefficient of each independent variable. Ø There is no necessary relationship between the significance of the overall relationship and the significance of the relationships for each of the individual predictors. When the overall relationship is significant, it is possible that none, some, or all of the individual relationships will be significant. S l i d e Similarities and differences - 3 5 Ø Multiple regression is required to satisfy all of the assumptions of simple linear regression: Ø 1. The relationship is linear Ø 2. The residuals have the same variance Ø 3. The residuals are independent of each other Ø 4. The residuals are normally distributed Ø Plus one additional assumption: Ø The independent variables are independent of one another, i.e. they add to the variance explained in the dependent variable rather than explain the same variance explained by other independent variables. S l i d e Similarities and differences - 4 6 Ø In a multiple regression equation, the coefficient for each individual variable represents the change in the dependent variable that it is uniquely responsible for, i.e. assuming the relationships between the other independent variables and the dependent variable. Ø The correlation between individual predictors results in contribution toward explaining the dependent variable made jointly by both, and not credited to either individual predictor. Ø In extreme cases, the relationship between independent variable is so strong that they are not credited with explaining the dependent variable, even though both might have a strong individual relationship to the dependent variable. S l i d e Similarities and differences - 5 7 Ø If this happens, we may have predictors that really have a strong relationship having a b coefficient that is not statistically significant. The interpretation, based on the non-significant b coefficient, that the variable did not have a relationship would be an error. Ø To satisfy the assumption of independence of variable, our regression must not include variables that are collinear. Ø The diagnostic statistic for detecting multicollinearity is “tolerance,” which SPSS includes in the table of coefficients. S l i d e Similarities and differences - 6 8 Ø In extreme cases of multicollinearity, SPSS cannot compute the regression equation. In this case, SPSS will exclude the variable which it thinks is producing the variable even though we have told it to include the variable in the analysis. S l i d e Similarities and differences - 7 9 Ø Having more than one predictor in the regression equation leads to the question of which variable has the more important relationship to the dependent variable, i.e. which has the largest impact on the predicted scores. Ø Since beta coefficients are standardized, the one with the largest absolute value (ignoring the sign) is the most important, since it is the amount of increase in standard deviations for the dependent variable that is produced by a one standard deviation change in the independent variable. S l i d e 1 Change in response for sample size 0 Ø On the simple linear regression problems, the answer was an Incorrect application of a statistic if the sample size available to the analysis was less than the number recommended by Tabachnick and Fidell. Ø In reviewing problems, there were numerous occasions when a smaller sample yielded a statistically significant result, making the response Incorrect application of a statistic inappropriate itself. Ø For these problems, I am changing the response to adding a caution when the answer is true. This reflects the possibility that planning a sample of the given size risked not finding a significant result, but does not negate an otherwise useful result. S l i d e Sample homework problem: 1 1 Multiple regression – part 1 Based on information from the data set 2001WorldFactbook.sav, is the This is the general framework for the the homework following statement true, false, problems in on multiple regression a statistic? or an incorrect application of assignment Use .05 for alpha. problems. "Population growth rate" [pgrowth],"total fertility rate" [fertrate] and "percent of the population below poverty line" [poverty] significantly predicted "infant mortality rate" [infmort]. The relationship was strong and reduced the error in predicting "infant mortality rate" by approximately 75% (R² = 0.753, F(3, 91) = 92.67, p < .001). "Population growth rate" significantly predicted "infant mortality rate", ß = -0.393, t(91) = -4.04, p < .001. Higher values of "population growth rate" were inversely related to lower values of "infant mortality rate". The problem includes a statement for the overall relationship, an individual statement for each of the "Total fertility rate" significantly predicted "infant mortality rate", ß = independent variables, and a statement on 0.965, t(91) = 8.90, p < .001. Higher predictors. the relative importance were of values of "total fertility rate" directly related to higher values of "infant mortality rate". S l i d e Sample homework problem: 1 2 Multiple regression - part 2 (cont’d) "Percent of the population below poverty line" significantly predicted "infant mortality rate", ß = 0.280, t(91) = 4.41, p < .001. Higher values of "percent of the population below poverty line" were directly related to higher values of "infant mortality rate". "Total fertility rate" [fertrate] was the most important predictor of the value of "infant mortality rate" [infmort] compared to the other independent variables. o True o True with caution o False The problem includes a statement for o the relationship, an Incorrect application of a statistic overall for each of the individual statement independent variables, and a statement on the relative importance of predictors. S l i d e Sample homework problem: 1 3 Data set and alpha Based on information from the data set 2001WorldFactbook.sav, is the following statement true, false, or an incorrect application of a statistic? Use .05 for alpha. The first paragraph identifies: "Population growth rate" [pgrowth],"total fertility rate" [fertrate] and • [poverty] to use, e.g. "percent of the population below poverty line"The data set significantly 2001WorldFactbook.sav predicted "infant mortality rate" [infmort]. The relationship was strong • The alpha level for the and reduced the error in predicting "infant mortality rate" by hypothesis test approximately 75% (R² = 0.753, F(3, 91) = 92.67, p < .001). "Population growth rate" significantly predicted "infant mortality rate", ß = -0.393, t(91) = -4.04, p < .001. Higher values of "population growth rate" were inversely related to lower values of "infant mortality rate". "Total fertility rate" significantly predicted "infant mortality rate", ß = 0.965, t(91) = 8.90, p < .001. Higher values of "total fertility rate" were directly related to higher values of "infant mortality rate". S l i d e Sample homework problem: 1 4 The overall relationship Based on information from the data set 2001WorldFactbook.sav, is the following statement true, false, or an incorrect application of a statistic? Use .05 for alpha. "Population growth rate" [pgrowth],"total fertility rate" [fertrate] and "percent of the population below poverty line" [poverty] significantly predicted "infant mortality rate" [infmort]. The relationship was strong and reduced the error in predicting "infant mortality rate" by approximately 75% (R² = 0.753, F(3, 91) = 92.67, p < .001). significantly predicted finding mortality rate", ß = "Population growth rate" second paragraph states the"infant that we The p < to verify with multiple regression. The -0.393, t(91) = -4.04, want .001. Higheravalues of "population growth rate" finding identifies: were inversely related to lower values of "infant mortality rate". • The independent variables • The dependent variable • The strength of the relationship "Total fertility rate" significantly predicted "infant mortality rate", ß = 0.965, t(91) = 8.90, p < .001. Higher values of "total fertility rate" were directly related to higher values of "infant mortality rate". S l i d e Sample homework problem: 1 5 Individual relationships Based on information from the data set 2001WorldFactbook.sav, is the true, false, or an incorrect application of a statistic? following statementEach of the paragraphs for the individual independent variables contains: Use .05 for alpha. • A statement about the significance of the relationship between the individual independent variable and the dependent "Population growth [pgrowth],"total fertility rate" [fertrate] rate"variable and "percent of the population below poverty line" [poverty] significantly • A statement about the direction of the predicted "infant mortality rate" [infmort]. The relationship was strong relationship between the individual independent variable and the dependent predicting "infant mortality rate" by and reduced the error invariable approximately 75% (R² = 0.753, F(3, 91) = 92.67, p < .001). "Population growth rate" significantly predicted "infant mortality rate", ß = -0.393, t(91) = -4.04, p < .001. Higher values of "population growth rate" were inversely related to lower values of "infant mortality rate". "Total fertility rate" significantly predicted "infant mortality rate", ß = 0.965, t(91) = 8.90, p < .001. Higher values of "total fertility rate" were directly related to higher values of "infant mortality rate". S l i d e Sample homework problem: 1 6 Importance of variables "Percent of the population below poverty line" significantly predicted "infant mortality rate", ß = 0.280, t(91) = 4.41, p < .001. Higher values of The last paragraph is a statement of the relative importance of were directly "percent of the population below poverty line" the predictors, related to e.g. which variable makes the largest higher values of "infant mortality rate".dependent variable. change in the "Total fertility rate" [fertrate] was the most important predictor of the value of "infant mortality rate" [infmort] compared to the other independent variables. The answer will be True if all parts of the problem are o True correct. o True with caution The answer will be False if any part of the o False problem is not correct. o Incorrect application of a statistic The answer to a problem The answer to a problem will will be True with caution if Incorrect application of a the analysis includes an statistic if the level of ordinal or we do not meet measurement or multicollinearity the sample size requirement is violated. requirement. S l i d e Solving the problem with SPSS: 1 7 Level of measurement Multiple regression requires that the dependent variable be interval and the independent variables be interval or dichotomous. "Infant mortality rate" [infmort] is interval level, satisfying the requirement for the dependent variable. "Population growth rate" [pgrowth] is interval level, satisfying the requirement for the independent variable. "Total fertility rate" [fertrate] is interval level, satisfying the requirement for the independent variable. "Percent of the population below poverty line" [poverty] is interval level, satisfying the requirement for the independent variable. S l i d e Solving the problem with SPSS: 1 8 Multiple regression -1 Before we can address the other issues involved in solving the problem, we need to generate the SPSS output. Select Regression > Linear… from the Analyze menu. S l i d e Solving the problem with SPSS: 1 9 Multiple regression -2 First, move the dependent variable infmort to the Dependent list box. Second, move the independent variables pgrowth, fertrate, and poverty to the Independents list box. Third, click on the Statistics button to add the additional statistics. S l i d e Solving the problem with SPSS: 2 0 Multiple regression -3 Second, click on the Continue button to close the dialog box. First, in addition to the SPSS defaults, we add the check box for Descriptives and Collinearity diagnositics. S l i d e Solving the problem with SPSS: 2 1 Multiple regression -4 When we return to the Linear Regression dialog box, we click on OK to obtain the output. S l i d e Solving the problem with SPSS: 2 2 Multicollinearity The tolerance values for all of the independent variables are larger than 0.10: "population growth rate" [pgrowth] (0.287), "total fertility rate" [fertrate] (0.230) and "percent of the population below poverty line" [poverty] (0.673). Multicollinearity is not a problem in this regression analysis. S l i d e Solving the problem with SPSS: 2 3 Sample size Using the rule of thumb from Tabachnick and Fidell that the required number of cases should be the larger of the number of independent variables x 8 + 50 or the number of independent variables + 105, multiple regression requires 108 cases. With 95 valid cases, the sample size requirement is not satisfied. A caution should be added to our findings. NOTE: adding a caution to our findings rather than concluding that it is not an appropriate use of statistics is a more reasonable response than what we did for multiple regression. S l i d e Solving the problem with SPSS: 2 4 Interpreting the overall relationship - 1 The first sentence in the finding states that: The R² of .753 is the "Population growth rate" reduction in error [pgrowth],"total fertility achieved by using scores rate" [fertrate] and "percent for Population growth of the population below rate" [pgrowth],"total poverty line" [poverty] fertility rate" [fertrate] significantly predicted and "percent of the "infant mortality rate" population below poverty [infmort]. The relationship line" [poverty] to predict was strong and reduced scores for "infant the error in predicting mortality rate" [infmort]. "infant mortality rate" by approximately 75% (R² = 0.753, F(3, 91) = 92.67, p < .001). The overall relationship between the independent variables "population growth rate" [pgrowth],"total fertility rate" [fertrate] and "percent of the population below poverty line" [poverty] and the dependent variable "infant mortality rate" [infmort] was statistically significant, R² = 0.753, F(3, 91) = 92.67, p < .001. S l i d e Solving the problem with SPSS: 2 5 Interpreting the overall relationship - 2 The first sentence in the finding states that: "Population growth rate" [pgrowth],"total fertility rate" [fertrate] and "percent of the population below poverty line" [poverty] significantly predicted "infant mortality rate" [infmort]. The relationship was strong and reduced the error in predicting "infant mortality rate" by approximately 75% (R² = 0.753, F(3, 91) = 92.67, p < .001). We reject the null hypothesis that all of the partial slopes (b coefficients) = 0 and conclude that at least one of the partial slopes (b coefficients) ≠ 0. S l i d e Solving the problem with SPSS: 2 6 Interpreting the overall relationship - 3 The first sentence in the finding states that: "Population growth rate" [pgrowth],"total fertility rate" [fertrate] and "percent of the population below poverty line" [poverty] significantly predicted "infant mortality rate" [infmort]. The relationship was strong and reduced the error in predicting "infant mortality rate" by approximately 75% (R² = 0.753, F(3, 91) = 92.67, p < .001). The Multiple R of 0.868 was correctly characterized as a strong relationship, using Cohen’s criteria: • r < .1 = Trivial • .1 ≤ r < .3 = Small or weak • .3 ≤ r < .5 = Medium or moderate • r ≥ .5 = Large or strong S l i d e Solving the problem with SPSS: 2 7 Interpreting individual relationships - 1 The second sentence in the finding states that: "Population growth rate" significantly predicted "infant mortality rate", β = -0.393, t(91) = -4.04, p < .001. Higher values of "population growth rate" were inversely related to lower values of "infant mortality rate". The individual relationship between the independent variable "population growth rate" [pgrowth] and the dependent variable "infant mortality rate" [infmort] was statistically significant, β = - 0.393, t(91) = -4.04, p < .001. We reject the null hypothesis that the partial slope (b coefficient) for the variable "population growth rate" = 0 and conclude that the partial slope (b coefficient) for the variable "population growth rate" ≠ 0. S l i d e Solving the problem with SPSS: 2 8 Interpreting individual relationships - 2 The second sentence in the finding states that: "Population growth rate" significantly predicted "infant mortality rate", β = -0.393, t(91) = -4.04, p < .001. Higher values of "population growth rate" were inversely related to lower values of "infant mortality rate". The negative sign of the B coefficient and the Beta coefficient implies that higher values of "population growth rate" were inversely related to lower values of "infant mortality rate". S l i d e Solving the problem with SPSS: 2 9 Interpreting individual relationships - 3 The third sentence in the finding states that: "Total fertility rate" significantly predicted "infant mortality rate", β = 0.965, t(91) = 8.90, p < .001. Higher values of "total fertility rate" were directly related to higher values of "infant mortality rate". The individual relationship between the independent variable "total fertility rate" [fertrate] and the dependent variable "infant mortality rate" [infmort] was statistically significant, β = 0.965, t(91) = 8.90, p < .001. We reject the null hypothesis that the partial slope (b coefficient) for the variable "total fertility rate" = 0 and conclude that the partial slope (b coefficient) for the variable "total fertility rate" ≠ 0. S l i d e Solving the problem with SPSS: 3 0 Interpreting individual relationships - 4 The third sentence in the finding states that: "Total fertility rate" significantly predicted "infant mortality rate", β = 0.965, t(91) = 8.90, p < .001. Higher values of "total fertility rate" were directly related to higher values of "infant mortality rate". The positive sign of the B coefficient and the Beta coefficient implies that higher values of "total fertility rate" were directly related to higher values of "infant mortality rate". S l i d e Solving the problem with SPSS: 3 1 Interpreting individual relationships - 5 The fourth sentence in the finding states that: "Percent of the population below poverty line" significantly predicted "infant mortality rate", β = 0.280, t(91) = 4.41, p < .001. Higher values of "percent of the population below poverty line" were directly related to higher values of "infant mortality rate". The individual relationship between the independent variable "percent of the population below poverty line" [poverty] and the dependent variable "infant mortality rate" [infmort] was statistically significant, β = 0.280, t(91) = 4.41, p < .001. We reject the null hypothesis that the partial slope (b coefficient) for the variable "population growth rate" = 0 and conclude that the partial slope (b coefficient) for the variable "population growth rate" ≠ 0. S l i d e Solving the problem with SPSS: 3 2 Interpreting individual relationships - 6 The fourth sentence in the finding states that: "Percent of the population below poverty line" significantly predicted "infant mortality rate", β = 0.280, t(91) = 4.41, p < .001. Higher values of "percent of the population below poverty line" were directly related to higher values of "infant mortality rate". The positive sign of the B coefficient and the Beta coefficient implies that higher values of "percent of the population below poverty line" were directly related to higher values of "infant mortality rate". S l i d e Solving the problem with SPSS: 3 3 Interpreting individual relationships - 7 The fifth sentence in the finding states that: "Total fertility rate" [fertrate] was the most important predictor of the value of "infant mortality rate" [infmort] compared to the other independent variables. "Total fertility rate" [fertrate] was the most important predictor because the absolute value of it's beta coefficient (0.965) was larger than the absolute value of the beta coefficients for the other independent variables. S l i d e Solving the problem with SPSS: 3 4 Answering the question The findings for this problem state that: • "Population growth rate" [pgrowth],"total fertility rate" [fertrate] and "percent of the population below poverty line" [poverty] significantly predicted "infant mortality rate" [infmort]. The relationship was strong and reduced the error in predicting "infant mortality rate" by approximately 75% (R² = 0.753, F(3, 91) = 92.67, p < .001). • "Population growth rate" significantly predicted "infant mortality rate", ß = -0.393, t(91) = -4.04, p < .001. Higher values of "population growth rate" were inversely related to lower values of "infant mortality rate". • "Total fertility rate" significantly predicted "infant mortality rate", ß = 0.965, t(91) = 8.90, p < .001. Higher values of "total fertility rate" were directly related to higher values of "infant mortality rate". • "Percent of the population below poverty line" significantly predicted "infant mortality rate", ß = 0.280, t(91) = 4.41, p < .001. Higher values of "percent of the population below poverty line" were directly related to higher values of "infant mortality rate". • "Total fertility rate" [fertrate] was the most important predictor of the value of "infant mortality rate" [infmort] compared to the other independent variables. All of the statements of findings are true, so the answer to the question is True with caution. The caution is added because we did not satisfy the required sample size. S l i d e Logic for multiple regression: 3 5 Level of measurement Measurement level of independent variable? Nominal Interval/Ordinal /Dichotomous Inappropriate application of Measurement a statistic level of dependent variable? Interval/ordinal Nominal/ Dichotomous Strictly speaking, the Inappropriate test requires an interval application of level variable. We will a statistic allow ordinal level variables with a caution. S l i d e Logic for multiple regression: 3 6 multicollinearity Compute linear regression including descriptive statistics Tolerance for all independent variables ≥ 0.10? No Yes Inappropriate application of a statistic S l i d e Logic for multiple regression: 3 7 Sample size requirement Compute linear regression including descriptive statistics Valid cases satisfies computed requirement? No The sample size requirement is Yes Caution added the larger of : to any true findings • the number of independent variables x 8 + 50 • the number of independent NOTE: violation of variables + 105 sample size requirements is a caution rather than an inappropriate application of a statistic. S l i d e Logic for multiple regression: 3 8 Significant, non-trivial overall relationship Probability for F-test for all coefficients less than or equal to alpha? No Yes False Effect size (Multiple R) is not trivial by Cohen’s scale, i.e. equal to or larger than 0.10? No Yes False S l i d e Logic for multiple regression: 3 9 Strength of overall relationship Strength of relationship correctly interpreted (Multiple R)? No Yes False Reduction in error correctly interpreted based Multiple R²? No Yes False S l i d e Logic for multiple regression: Significance 4 0 and direction individual relationships Probability for t-test for B coefficient less than or equal to alpha? No These steps must be repeated for each independent Yes False variable. Direction of relationship correctly interpreted based on B or Beta coefficient? No Yes False S l i d e Logic for multiple regression: 4 1 Importance of individual predictors Predictor with largest absolute Beta identified as most important? No Yes False The statistics in the SPSS output match all of the statistics cited in the problem? No Add caution if Yes False dependent or independent variable is ordinal or we do not meet sample size True requirement.