An Introduction to Statistics and SPSS - PowerPoint

Document Sample
An Introduction to Statistics and SPSS - PowerPoint Powered By Docstoc
					           PSYM021
Introduction to Methods & Statistics


Week Five: Statistical techniques III

            Cris Burgess
                      Regression

   Web support

   Simple regression – a reminder

   Multiple regression – an introduction

   Reporting regression analyses

   Choosing regressors (predictor variables)

   Choosing a regression model

   Model checking - residuals
                  Simple Regression

   Establish equation for the best-fit line:
                           y = bx + a


   “Best-fit” line same as “Regression” line
   b is the “regression coefficient” for x
   x is the “predictor” or “regressor” variable for y
            Multiple Regression

   Establish equation for the best-fit line:
              y = b1x1 + b2x2 + b3x3 + a


    Where:
    b1 = regression coefficient for variable x1
    b2 = regression coefficient for variable x2
    b3 = regression coefficient for variable x3
    a = constant
                      Multiple Regression
                        R2 - “Goodness of fit”
                              Model Summary

                                         Adjusted     Std. Error of
           Model      R       R Square   R Square    the Estimate
           1           .721 a     .520       .399       17.70134
             a. Predictors: (Constant), AGE, GENDER, INCOME


   For multiple regression, R2 will get larger every time another
    independent variable (regressor/predictor) is added to the model
       Add “work stress” to model ?
   New regressor may only provide a tiny improvement in amount
    of variance in the data explained by the model
   Need to establish the „added value‟ of each additional regressor
    in predicting the DV
                  Multiple Regression
                R2adj - “adjusted R-square”

   Takes into account the number of regressors in the model
   Calculated as:
                R2adj = 1 - (1-R2)(N-1)/(N-n-1)
    where:
                N = number of data points
                n = number of regressors
   You don‟t need to memorise this equation, but…
   Note that R2adj will always be smaller than R2
    How well does a model explain the variation in the
                  dependent variable?

   “Effectiveness” vs “Efficiency”
   Effectiveness:
    maximises R2
    ie: maximises proportion of variance explained by model
   Efficiency:
    maximises increase in R2adj upon adding another regressor
    ie: if new regressor doesn‟t add much to the variance explained,
        it is not worth adding
    How well does a model explain the variation in the
                  dependent variable?

    Effectiveness (R2 and R2adj)
     0 - 25%           very poor and likely to be unacceptable
     25 - 50%          poor, but may be acceptable
     50 - 75%          good
     75 - 90%          very good
     90% +             likely that there is something wrong with
                       your analysis
        Are the regressors, taken together, significantly
           associated with the dependent variable?
                                            ANOVAb

                               Sum of
         Model                 Squares      df        Mean Square   F       Sig.
         1       Regression   4065.388            3     1355.129    4.325     .028 a
                 Residual     3760.050           12      313.337
                 Total        7825.438           15
           a. Predictors: (Constant), AGE, GENDER, INCOME
           b. Dependent Variable: DEPRESS



   Analysis of Variance test checks to see if model, as a whole, has a
    significant relationship with the DV
   Part of the predictive „value‟ of each regressor may be shared by
    one or more of the other regressors in the model, so the model must
    be considered as a whole (i.e. all regressors/IVs together)
   Read off ANOVA table in SPSS output, and report as you did in
    week 3/4 assignments
     What relationship does each individual regressor
            have with the dependent variable?
                                              Coefficientsa

                                  Unstandardized         Standardized
                                    Coefficients          Coefficients
         Model                    B         Std. Error       Beta          t        Sig.
         1       (Constant)      68.285       15.444                      4.421       .001
                 INCOME       -9.34E-02          .029            -.682   -3.178       .008
                 GENDER           3.306         8.942             .075       .370     .718
                 AGE               -.162         .344            -.101     -.470      .646
           a. Dependent Variable: DEPRESS



   SPSS output table entitled Coefficients
   Column headed Unstandardised coefficients - B
   Gives regression coefficient for each regressor variable (IV)
   “With all the other variables held constant”
   Units of coefficient are same as those for regressor (IV)
    What relationship does each individual regressor
           have with the dependent variable?

   Units of coefficient are same as those for variable
    eg: dependent variable  score on video game (in points)
        regressor  time of day (in hours)
        B coefficient for time = 844.57
                score = (B coefficient x time) + constant
                score = (844.57 time) – 4239.6
   This means that for every increase of one hour in the variable
    time, we would predict that a person‟s score will increase by
    844.57 points
    What relationship does each individual regressor
           have with the dependent variable?
        dependent variable  score on video game
        regressor  gender
   Gender coded so that: 1 = male, 2 = female
    Let B coefficient for gender = 100.00
    So,        score = 100.00 gender + constant
   Adding “1” to the variable gender means that we go from
    male to female
   This means that females would be expected to score 100.00
    points more than males
   Remember that the B coefficient is calculated on the basis
    that 1=male and 2=female (different coding will give a
    different coefficient)
Which regressor has the most effect on the dependent
                     variable?

   Units for each regression coefficient are different, so we
    must standardise them if we want to compare one with
    another
   Column headed Standardised coeficients - Beta
   Can compare the Beta weights for each regressor variable
    to compare effects of each on the dependent variable
   Larger Beta weight indicates stronger effect of regressor
    on values of DV
    Are the relationships of each regressor with the
     dependent variable statistically significant?


   Assessed using a t-test
   Check values in column headed t and sig
   If regression coefficient is negative, then t-value will also
    be negative (it does not matter about the sign, it is the size
    of t that is important)
          Reporting regression analyses

   How should I report a regression analysis?
              Reporting Regression analyses

   Describe the characteristics of the model before you describe
    the significance of the relationship
   So:
    1. R2, R2adj - how well does the model fit the data?
    2. Fm,n      - is the relationship significant?
    3. Regression equation     - how to calculate values of
                               DV from known values of IVs?
    4. Describe results in plain English
             Reporting Regression analyses

We want to predict IQ score
using brain size (MRI), height and gender as regressors


   Units:
       IQ: IQ points
       brain size (MRI): pixels
       height: centimetres
       gender: 0 = male, 1 = female
    Reporting Regression analyses (1)




    SPSS output tells us that:
         R2 = 21.7%     R2adj = 14.6%
    Reporting Regression analyses (2)




   SPSS output tells us that:
                F 3,33 = 3.051, p < 0.05
           Reporting Regression analyses (3)




Regression equation:
               y = b1x1 + b2x2 + b3x3 + b4x4 + a
IQ = 1.824x10-4 MRI – 0.316 height + 2.426 gender + (-6.411)
    = 0.0001824 MRI – 0.316 height + 2.426 gender + (-6.411)
    = 0.0002 MRI – 0.316 height + 2.426 gender + (-6.411)
           Reporting Regression analyses (4)

   “The regression was a poor fit, describing only 21.7% of the
    variance in IQ (R2adj= 14.6%), but the overall relationship was
    statistically significant (F3,33= 3.05, p<0.05).”
   “With other variables held constant, IQ scores were negatively
    related to height, decreasing by 0.32 IQ points for every extra
    centimetre in height, and positively related to brain size,
    increasing by 0.0002 IQ points for every extra pixel of the
    scan. Women tended to have higher scores than men, by 2.43
    IQ points. However, the effect of brain size (MRI) was the only
    significant effect (t33=2.75, p=0.01)”
                       Break
   Five minutes – please be back promptly
                      Selecting Regressors


   What do we want of a regressor?
       To have „a significant effect‟ on the dependent variable
       Ability to „discriminate‟ between values of the dependent
        variable
                                            Selecting Regressors
     How well do potential regressors predict the Dependent Variable?

                     25
                                                           Dichotomous variable (eg: gender)
Dependent variable




                     20
                                                           Compare using t-test
                     15
                                                           If significant, then possible regressor
                     10                                     predicts differences in dependent
                      5
                                                            variable

                      0
                              Male        Female
                          Possible regressor (gender)
                                             Selecting Regressors
     How well do potential regressors predict the Dependent Variable?

                     12
                                                           Continuous variable (eg: Height)
                     10
Dependent variable




                      8
                                                           Compare using correlation
                      6                                    If significant, then possible regressor
                      4
                                                            predicts differences in dependent
                                                            variable
                      2
                      0
                          0         100         200
                          Possible regressor (height)
                     Selecting Regressors

   Some of „discriminatory value‟ in regressor may be accounted
    for by regressors present in model already
       gender, income, height
       age, experience, value of property
   „In the presence of all regressors‟
       Adding regressor may not add as much to model‟s
        predictive value as you might have anticipated
          What makes the best model?

   Same number of regressors
       Choose model with highest value of R2adj
       This gives „best value‟ per regressor
       Will also have the highest value of R2 and F
   Different number of regressors
       Highest value of R2adj (more regressors)
       Highest value of F (fewer regressors)
                  Efficiency vs Effectiveness

   Effective: highest R2 („most complete‟)
       will have more regressors
       will be effective, but not efficient
   Efficient: highest F-ratio („most significant‟)
       will have fewer regressors
       will be efficient, but not particularly effective
   Compromise: largest increase in R2adj (best of both worlds)
       will contain only the „best‟ regressors available
       manageable number of regressors and reasonably effective
               Minitab‟s BREG command

   Tries every possible combination of available regressors
    (up to maximum of 20)
       eg: 20 regressors give over 1,000,000 different models
   Command:
       Dependent variable is in column 10
       Independent variables in columns 1 to 6
       BREG C10 C1-C6
   Will not be required to carry out this type of analysis in
    exam, but you need to be able to interpret output
               Sample of BREG output
MTB > BREG C13 C1-C12
Best Subsets Regression
Response is prodebt
304 cases used 160 cases contain missing values.

                                     i       c                       c           l
                                     n       h   s       b   b       c       x   o
                                     c       i   i       a   s   m   a   c   m   c
                                     o   h   l   n   a   n   o   a   r   i   a   i
                                     m   o   d   g   g   k   c   n   d   g   s   n
                                     e   u   r   p   e   a   a   a   u   b   b   t
              Adj.                   g   s   e   a   g   c   c   g   s   u   u   r
Vars   R-Sq   R-Sq   C-p         s   p   e   n   r   p   c   c   e   e   y   y   n
   7   19.3   17.4   7.3   0.65539   X               X           X   X   X   X   X
   7   19.1   17.2   7.8   0.65602   X               X       X   X   X       X   X
   8   19.9   17.7   6.9   0.65388   X               X       X   X   X   X   X   X
   8   19.5   17.4   8.2   0.65536   X       X       X           X   X   X   X   X
   9   20.2   17.8   7.8   0.65375   X       X       X       X   X   X   X   X   X
   9   20.1   17.6   8.3   0.65434   X   X           X       X   X   X   X   X   X
  10   20.4   17.6   9.3   0.65427   X   X X         X       X   X   X   X   X   X

                       BREG output

   Best two models for each possible number of regressors
    are displayed in output
   Compare R2adj values directly
   Select best model(s)
   Run normal regression in SPSS for each selected model
   Compare F-ratio values
            Best Subset Regression model

   Identify best subset of regressors from BREG output
   Must run ordinary regression procedure
       calculates F-ratio
       calculates individual coefficients and significance
   Highest R2adj values result in significant F-ratios
       if F-ratio not significant, check data and procedure
   BUT: Advisable to try two or three models, as the
    number of respondents contributing to each analysis
    may not be the same between Minitab and SPSS
                Equivalent SPSS procedures
   Choose procedure by selecting appropriate tab in drop-down
    menu
   “Enter” procedure:
       Adds all regressors to model simultaneously
       Calculates F-ratio and R2adj for all regressors
   “Stepwise” procedure:
       Adds regressors one at a time
       Calculates F-ratio and R2adj for each set of regressors
       considers taking regressors out at each stage
                         Missing values

   Frequently have values missing from data set
       missed out questions
       couldn‟t understand question
       couldn‟t collect data for some reason
   Must specify missing values in SPSS in „Define Variable‟
    window
   Differences in R2adj or F-ratio values are most likely to be due
    to missing values
   Leads to different “n” in each analysis
                   Model checking

   Residuals (general)

   Unusual observations – “outliers”
               Model checking - Residuals


   Predicted value for “y” (dependent variable)
                y = b1x1 + b2x2 + … + a

   Actual (observed) value for “y”

Actual (observed) value minus predicted (calculated) value
                                           Model checking - Residuals

                 180                                                                 160
                 160                                                                 140
                 140
                                                                                     120
                 120




                                                                    S ymptom Index
S ymptom Index




                                                                                     100
                 100
                                                                                     80
                  80
                                                                                     60
                  60

                  40                                                                 40

                  20                                                                 20

                   0                                                                  0
                       0     50      100     150        200   250                          0     50      100     150        200   250
                                  Drug A (dose in mg)                                                 Drug B (dose in mg)


                              Good fit                                                            Moderate fit
                            low residuals                                                      larger residuals
                Model checking - Residuals

Residuals should be:
   Normally distributed
        some big, some small, most average-sized
   Independent of one another
        no constant covariation with one another
   almost identical in terms of variance
        regardless of the values of the IVs or DVs


    These things are easy to check with SPSS „plots‟ option
          Model checking - Unusual observations


   Outliers                           80



    Linear regression would            70



    work quite well for this           60


    data, except for the               50


    presence of three outlier          40


    points                             30


                                       20
                                EXAM




                                       10
                                            0         10   20


                                            ANXIETY
                    Dealing with outliers
   Run regression analysis
   Plot data on a scattergram
   Remove outliers by deleting the rows in SPSS
   Run regression analysis again
   Note any qualitative differences:
        if there are qualitative differences, then check data. If no
        errors, report both analyses
        if only quantitative differences, then leave outliers in
        analysis, noting their presence
                          Justification


   Removing outliers                   80


                                        70

    Plotting data may indicate
                                        60
    that some participants
    belong to a separate sub-           50



    sample.                             40


                                        30

    Eg: people with an
                                        20
        exam phobia?
                                 EXAM




                                        10
                                             0         10   20


                                             ANXIETY
                           Residuals
                                DV vs IV
                                    Differences between actual and
       80

                                     predicted values (ie: residual
       70
                                     values) should show a normal
       60
                                     distribution)
       50


       40
                                    Some large positive
       30
                                    Some large negative
       20
EXAM




       10
                                    But mostly small (positive or
            0         10    20
                                     negative), or zero
            ANXIETY

                                     ie: Normally distributed
                           Residuals

       80


       70                             DV vs IV
       60
                                          If our best-fit line does
       50                                  not fit too well, this will
       40                                  be revealed in the
       30
                                           distribution of the
                                           Residuals
       20
EXAM




       10
            0         10      20


            ANXIETY
                     Questions ?

   Final assignment due in Friday midday

   Next week: Alex Haslam‟s “Uncertainty Management”

   Thank you and goodnight !

				
DOCUMENT INFO