VIEWS: 11,343 PAGES: 18 CATEGORY: Statistics POSTED ON: 10/14/2009 Public Domain
Regression Analysis in Marketing Research Prediction Two approaches to prediction: Extrapolation: using past events to predict future events (like steering a canoe by looking behind you) Predictive modeling: using one or more other variables to predict another variable Ex: Could you predict success in a course if you knew the hours spent studying, number of other semester hours taken, and hours spent working? The difference between a predicted value and an actual value is called a residual. Variables The dependent variable is what we are trying to predict - it is typically represented by Y. The independent variable is a variable used to predict the dependent variable - it is typically represented by x. Note that independent variable predicts the dependent variable - it cannot be stated that the independent variable (x) causes changes in the dependent variable (Y). Regression typically uses interval/ratio scales variables as the independent and dependent variable. You can also use dummy coding (1, 0) for nominally scaled measures (a “1” if a characteristic is present, a “0” if that characteristic is absent. Bivariate Regression Bivariate linear regression (simple regression) investigates a straight line relationship of the type Y = a + bx + e where Y is the dependent variable, x is the independent variable, and a and b are two constants to be estimated. a Regression basically fits the data to a straight line, where a is the intercept point and b is the slope of the line. SPSS fits the line to minimize vertical distances between points and the regression line. This is called the least squares criterion. Bivariate Regression in SPSS Step 1 This is “Y” Step 2 This is “x” Bivariate Regression in SPSS: Results 19.0% of the variation in BIAmtrak can be accounted for by AAmtrak., meaning 81% of the variation is unaccounted for. The equation is significantly better than chance, as evidenced by the F-value The significant t-value suggests that Aamtrak belongs in the equation. The significant constant indicates there is considerable variation unexplained. The unstandardized equation would be: Y = 3.132 + .507(Aamtrak) Thus, if a subject had an Aamtrak score of 2, the equation would predict Y = 3.132 + .507(2) = 4.146 Multiple Regression Multiple regression allows for the simultaneous investigation of two or more independent variables and a single dependent variable. Multiple regression is quite useful - it is likely that several variables are related to an independent variable. Regression is useful when we want to explain, predict, or control a dependent variable. The use of the unstandardized coefficients allows you to use the equation in a very practical way. The form for an unstandardized equation is Y = a + b 1x1 + b2x2 + … bixi Multiple Regression Each coefficient in multiple regression is also known as a coefficient of partial regression - it assesses the relationship between itself (Xi) and the dependent variable (Y) not accounted for by other variables in the model. Each variable introduced into the equation needs to account for variation in Y that has not be accounted for by any of the X variables already entered. We typically assume that the X variables are uncorrelated with one another. If they are not uncorrelated, we have a problem of multicollinearity. Multiple regression Multicollinearity is a problem in regression - it occurs when the independent variables are highly correlated with one another. Multicollinearity does not affect the models overall ability to predict, but it can impact the interpretation of individual coefficients. Multicollinearity can be assessed through the use of a statistic, the variance inflation factor (VIF) If VIF < 10, multicollinearity is not a problem. If VIF > 10, remove the variable from the independent variables and run the analysis again. Interpreting Regression Results R2 It is a coefficient of determination - it indicates the percentage of of variation in Y explained by the variation in the independent variables (Xi). It determines the goodness of fit for your model (regression equation). It ranges from 0-1.0. It measures the accuracy of predictions using the regression equation. The smaller the std. error of the estimate, the smaller the confidence interval (the more precise the prediction) Std. error of the estimate Interpreting Regression Results F-values: The F-value determines whether the equation is better than chance. A p-value of .05 or lower indicates we would reject the null hypothesis that the independent variables are not related to the dependent variable. The F-value does not measure whether your model does a good job of predicting - only that it is better than chance. T-tests: Examine the t-values to determine whether to include additional variables into the model. T-values should be statistically significant to be included in your analysis. Interpreting Regression Results Unstandardized coefficients (abbreviated as B) These are written in the metric of the measure, which makes them useful for prediction. Standardized coefficients (beta) These are written in a standardized form, ranging from 0 to 1. The higher the value of the standardized coefficient, the more important the predictor is to the model. (i.e., the more unique variation in Y than can be accounted for by that variable) Introducing more variables into an equation typically explains more variation (increases R2), but each variable must be a significant contributor of otherwise unexplained variation to include in the model (see T-test results to determine this.) Multiple Regression in SPSS Step 1 Step 2 Multiple Regression in SPSS: Results Note that the circled t-values for two of the variables are not significant – these do not supply any unique variation to the prediction of the dependent variable, so they should be removed from analysis. 2. Note the standardized coefficients (beta): the greater the beta, the more important a variable is to the prediction of the dependent variable. 3. Finally, not the size of the t-value for the constant – this suggests the model still has considerable unexplained variation. Y = 3.219 + .235(Aamtrak, Good/Bad) + .245(Aamtrak, like/dislike) - .0638(Aauto, goob/bad) 1. Multiple Regression in SPSS: Results The model indicates that the five predictors account for 21.5% of the variation in Aamtrak. The F-value suggests that the equation is significantly better than chance. Multiple Regression Example: Toy Manufacturer Sales Hypothesis How are weekly toy sales affected by changes in levels of advertising, the use of sales reps vs. agents for calling on retailers, and local school enrollments? Toy Sales = Advertising(X1)+ sales rep/agent(X2)+ school enrollment(X3) + e To do this, we need to dummy code: sales rep = 1 or agent = 0. This produces the following equation; Y = 102.18 + 3.87X1 + 115.2X2 + 6.73X3 R2 = 0.845 So what does this mean? Multiple Regression So what do those coefficients mean? Y = 102.18 + 3.87X1 + 115.2X2 + 6.73X3 + e If the other variables are held constant, you could state: X1: $1 spent in advertising yields $3.87 in sales. X2: The use of a salesperson instead of an agent contributes $115.20 in additional sales. X3: Each additional school enrollment yields $6.73 in toy sales. These three variables explain 84.5% of the variation in toy sales. If we spent $1000 in advertising, used a sales rep., and there are 500 children in the local schools, what would sales be? Y = 102.18 + 3.87(1000) + 115.2(1) + 6.73(500) Y = $7350.2