PSYM021 Introduction to Methods & Statistics Week Five: Statistical techniques III Cris Burgess Regression Web support Simple regression – a reminder Multiple regression – an introduction Reporting regression analyses Choosing regressors (predictor variables) Choosing a regression model Model checking - residuals Simple Regression Establish equation for the best-fit line: y = bx + a “Best-fit” line same as “Regression” line b is the “regression coefficient” for x x is the “predictor” or “regressor” variable for y Multiple Regression Establish equation for the best-fit line: y = b1x1 + b2x2 + b3x3 + a Where: b1 = regression coefficient for variable x1 b2 = regression coefficient for variable x2 b3 = regression coefficient for variable x3 a = constant Multiple Regression R2 - “Goodness of fit” Model Summary Adjusted Std. Error of Model R R Square R Square the Estimate 1 .721 a .520 .399 17.70134 a. Predictors: (Constant), AGE, GENDER, INCOME For multiple regression, R2 will get larger every time another independent variable (regressor/predictor) is added to the model Add “work stress” to model ? New regressor may only provide a tiny improvement in amount of variance in the data explained by the model Need to establish the „added value‟ of each additional regressor in predicting the DV Multiple Regression R2adj - “adjusted R-square” Takes into account the number of regressors in the model Calculated as: R2adj = 1 - (1-R2)(N-1)/(N-n-1) where: N = number of data points n = number of regressors You don‟t need to memorise this equation, but… Note that R2adj will always be smaller than R2 How well does a model explain the variation in the dependent variable? “Effectiveness” vs “Efficiency” Effectiveness: maximises R2 ie: maximises proportion of variance explained by model Efficiency: maximises increase in R2adj upon adding another regressor ie: if new regressor doesn‟t add much to the variance explained, it is not worth adding How well does a model explain the variation in the dependent variable? Effectiveness (R2 and R2adj) 0 - 25% very poor and likely to be unacceptable 25 - 50% poor, but may be acceptable 50 - 75% good 75 - 90% very good 90% + likely that there is something wrong with your analysis Are the regressors, taken together, significantly associated with the dependent variable? ANOVAb Sum of Model Squares df Mean Square F Sig. 1 Regression 4065.388 3 1355.129 4.325 .028 a Residual 3760.050 12 313.337 Total 7825.438 15 a. Predictors: (Constant), AGE, GENDER, INCOME b. Dependent Variable: DEPRESS Analysis of Variance test checks to see if model, as a whole, has a significant relationship with the DV Part of the predictive „value‟ of each regressor may be shared by one or more of the other regressors in the model, so the model must be considered as a whole (i.e. all regressors/IVs together) Read off ANOVA table in SPSS output, and report as you did in week 3/4 assignments What relationship does each individual regressor have with the dependent variable? Coefficientsa Unstandardized Standardized Coefficients Coefficients Model B Std. Error Beta t Sig. 1 (Constant) 68.285 15.444 4.421 .001 INCOME -9.34E-02 .029 -.682 -3.178 .008 GENDER 3.306 8.942 .075 .370 .718 AGE -.162 .344 -.101 -.470 .646 a. Dependent Variable: DEPRESS SPSS output table entitled Coefficients Column headed Unstandardised coefficients - B Gives regression coefficient for each regressor variable (IV) “With all the other variables held constant” Units of coefficient are same as those for regressor (IV) What relationship does each individual regressor have with the dependent variable? Units of coefficient are same as those for variable eg: dependent variable score on video game (in points) regressor time of day (in hours) B coefficient for time = 844.57 score = (B coefficient x time) + constant score = (844.57 time) – 4239.6 This means that for every increase of one hour in the variable time, we would predict that a person‟s score will increase by 844.57 points What relationship does each individual regressor have with the dependent variable? dependent variable score on video game regressor gender Gender coded so that: 1 = male, 2 = female Let B coefficient for gender = 100.00 So, score = 100.00 gender + constant Adding “1” to the variable gender means that we go from male to female This means that females would be expected to score 100.00 points more than males Remember that the B coefficient is calculated on the basis that 1=male and 2=female (different coding will give a different coefficient) Which regressor has the most effect on the dependent variable? Units for each regression coefficient are different, so we must standardise them if we want to compare one with another Column headed Standardised coeficients - Beta Can compare the Beta weights for each regressor variable to compare effects of each on the dependent variable Larger Beta weight indicates stronger effect of regressor on values of DV Are the relationships of each regressor with the dependent variable statistically significant? Assessed using a t-test Check values in column headed t and sig If regression coefficient is negative, then t-value will also be negative (it does not matter about the sign, it is the size of t that is important) Reporting regression analyses How should I report a regression analysis? Reporting Regression analyses Describe the characteristics of the model before you describe the significance of the relationship So: 1. R2, R2adj - how well does the model fit the data? 2. Fm,n - is the relationship significant? 3. Regression equation - how to calculate values of DV from known values of IVs? 4. Describe results in plain English Reporting Regression analyses We want to predict IQ score using brain size (MRI), height and gender as regressors Units: IQ: IQ points brain size (MRI): pixels height: centimetres gender: 0 = male, 1 = female Reporting Regression analyses (1) SPSS output tells us that: R2 = 21.7% R2adj = 14.6% Reporting Regression analyses (2) SPSS output tells us that: F 3,33 = 3.051, p < 0.05 Reporting Regression analyses (3) Regression equation: y = b1x1 + b2x2 + b3x3 + b4x4 + a IQ = 1.824x10-4 MRI – 0.316 height + 2.426 gender + (-6.411) = 0.0001824 MRI – 0.316 height + 2.426 gender + (-6.411) = 0.0002 MRI – 0.316 height + 2.426 gender + (-6.411) Reporting Regression analyses (4) “The regression was a poor fit, describing only 21.7% of the variance in IQ (R2adj= 14.6%), but the overall relationship was statistically significant (F3,33= 3.05, p<0.05).” “With other variables held constant, IQ scores were negatively related to height, decreasing by 0.32 IQ points for every extra centimetre in height, and positively related to brain size, increasing by 0.0002 IQ points for every extra pixel of the scan. Women tended to have higher scores than men, by 2.43 IQ points. However, the effect of brain size (MRI) was the only significant effect (t33=2.75, p=0.01)” Break Five minutes – please be back promptly Selecting Regressors What do we want of a regressor? To have „a significant effect‟ on the dependent variable Ability to „discriminate‟ between values of the dependent variable Selecting Regressors How well do potential regressors predict the Dependent Variable? 25 Dichotomous variable (eg: gender) Dependent variable 20 Compare using t-test 15 If significant, then possible regressor 10 predicts differences in dependent 5 variable 0 Male Female Possible regressor (gender) Selecting Regressors How well do potential regressors predict the Dependent Variable? 12 Continuous variable (eg: Height) 10 Dependent variable 8 Compare using correlation 6 If significant, then possible regressor 4 predicts differences in dependent variable 2 0 0 100 200 Possible regressor (height) Selecting Regressors Some of „discriminatory value‟ in regressor may be accounted for by regressors present in model already gender, income, height age, experience, value of property „In the presence of all regressors‟ Adding regressor may not add as much to model‟s predictive value as you might have anticipated What makes the best model? Same number of regressors Choose model with highest value of R2adj This gives „best value‟ per regressor Will also have the highest value of R2 and F Different number of regressors Highest value of R2adj (more regressors) Highest value of F (fewer regressors) Efficiency vs Effectiveness Effective: highest R2 („most complete‟) will have more regressors will be effective, but not efficient Efficient: highest F-ratio („most significant‟) will have fewer regressors will be efficient, but not particularly effective Compromise: largest increase in R2adj (best of both worlds) will contain only the „best‟ regressors available manageable number of regressors and reasonably effective Minitab‟s BREG command Tries every possible combination of available regressors (up to maximum of 20) eg: 20 regressors give over 1,000,000 different models Command: Dependent variable is in column 10 Independent variables in columns 1 to 6 BREG C10 C1-C6 Will not be required to carry out this type of analysis in exam, but you need to be able to interpret output Sample of BREG output MTB > BREG C13 C1-C12 Best Subsets Regression Response is prodebt 304 cases used 160 cases contain missing values. i c c l n h s b b c x o c i i a s m a c m c o h l n a n o a r i a i m o d g g k c n d g s n e u r p e a a a u b b t Adj. g s e a g c c g s u u r Vars R-Sq R-Sq C-p s p e n r p c c e e y y n 7 19.3 17.4 7.3 0.65539 X X X X X X X 7 19.1 17.2 7.8 0.65602 X X X X X X X 8 19.9 17.7 6.9 0.65388 X X X X X X X X 8 19.5 17.4 8.2 0.65536 X X X X X X X X 9 20.2 17.8 7.8 0.65375 X X X X X X X X X 9 20.1 17.6 8.3 0.65434 X X X X X X X X X 10 20.4 17.6 9.3 0.65427 X X X X X X X X X X BREG output Best two models for each possible number of regressors are displayed in output Compare R2adj values directly Select best model(s) Run normal regression in SPSS for each selected model Compare F-ratio values Best Subset Regression model Identify best subset of regressors from BREG output Must run ordinary regression procedure calculates F-ratio calculates individual coefficients and significance Highest R2adj values result in significant F-ratios if F-ratio not significant, check data and procedure BUT: Advisable to try two or three models, as the number of respondents contributing to each analysis may not be the same between Minitab and SPSS Equivalent SPSS procedures Choose procedure by selecting appropriate tab in drop-down menu “Enter” procedure: Adds all regressors to model simultaneously Calculates F-ratio and R2adj for all regressors “Stepwise” procedure: Adds regressors one at a time Calculates F-ratio and R2adj for each set of regressors considers taking regressors out at each stage Missing values Frequently have values missing from data set missed out questions couldn‟t understand question couldn‟t collect data for some reason Must specify missing values in SPSS in „Define Variable‟ window Differences in R2adj or F-ratio values are most likely to be due to missing values Leads to different “n” in each analysis Model checking Residuals (general) Unusual observations – “outliers” Model checking - Residuals Predicted value for “y” (dependent variable) y = b1x1 + b2x2 + … + a Actual (observed) value for “y” Actual (observed) value minus predicted (calculated) value Model checking - Residuals 180 160 160 140 140 120 120 S ymptom Index S ymptom Index 100 100 80 80 60 60 40 40 20 20 0 0 0 50 100 150 200 250 0 50 100 150 200 250 Drug A (dose in mg) Drug B (dose in mg) Good fit Moderate fit low residuals larger residuals Model checking - Residuals Residuals should be: Normally distributed some big, some small, most average-sized Independent of one another no constant covariation with one another almost identical in terms of variance regardless of the values of the IVs or DVs These things are easy to check with SPSS „plots‟ option Model checking - Unusual observations Outliers 80 Linear regression would 70 work quite well for this 60 data, except for the 50 presence of three outlier 40 points 30 20 EXAM 10 0 10 20 ANXIETY Dealing with outliers Run regression analysis Plot data on a scattergram Remove outliers by deleting the rows in SPSS Run regression analysis again Note any qualitative differences: if there are qualitative differences, then check data. If no errors, report both analyses if only quantitative differences, then leave outliers in analysis, noting their presence Justification Removing outliers 80 70 Plotting data may indicate 60 that some participants belong to a separate sub- 50 sample. 40 30 Eg: people with an 20 exam phobia? EXAM 10 0 10 20 ANXIETY Residuals DV vs IV Differences between actual and 80 predicted values (ie: residual 70 values) should show a normal 60 distribution) 50 40 Some large positive 30 Some large negative 20 EXAM 10 But mostly small (positive or 0 10 20 negative), or zero ANXIETY ie: Normally distributed Residuals 80 70 DV vs IV 60 If our best-fit line does 50 not fit too well, this will 40 be revealed in the 30 distribution of the Residuals 20 EXAM 10 0 10 20 ANXIETY Questions ? Final assignment due in Friday midday Next week: Alex Haslam‟s “Uncertainty Management” Thank you and goodnight !