Theory of Regression by gi96f4T

VIEWS: 11 PAGES: 843

									Theory of Regression




                       1
               The Course
• 16 (or so) lessons
  – Some flexibility
     • Depends how we feel
     • What we get through




                             2
Part I: Theory of Regression
1. Models in statistics
2. Models with more than one parameter:
   regression
3. Why regression?
4. Samples to populations
5. Introducing multiple regression
6. More on multiple regression




                                          3
Part 2: Application of regression
7.    Categorical predictor variables
8.    Assumptions in regression analysis
9.    Issues in regression analysis
10.   Non-linear regression
11.   Moderators (interactions) in regression
12.   Mediation and path analysis
Part 3: Advanced Types of Regression
13.   Logistic Regression
14.   Poisson Regression
15.   Introducing SEM
16.   Introducing longitudinal multilevel models
                                                   4
               House Rules
• Jeremy must remember
  – Not to talk too fast
• If you don‟t understand
  – Ask
  – Any time
• If you think I‟m wrong
  – Ask. (I‟m not always right)

                                  5
   Learning New Techniques
• Best kind of data to learn a new technique
  – Data that you know well, and understand
• Your own data
  – In computer labs (esp later on)
  – Use your own data if you like
• My data
  – I‟ll provide you with
  – Simple examples, small sample sizes
     • Conceptually simple (even silly)
                                               6
          Computer Programs
• SPSS
    – Mostly
• Excel
    – For calculations
•   GPower
•   Stata (if you like)
•   R (because it‟s flexible and free)
•   Mplus (SEM, ML?)
•   AMOS (if you like)
                                         7
8
9
Lesson 1: Models in statistics

  Models, parsimony, error, mean,
          OLS estimators



                                    10
What is a Model?




                   11
          What is a model?
• Representation
  – Of reality
  – Not reality
• Model aeroplane represents a real
  aeroplane
  – If model aeroplane = real aeroplane, it
    isn‟t a model

                                              12
• Statistics is about modelling
  – Representing and simplifying
• Sifting
  – What is important from what is not
    important
• Parsimony
  – In statistical models we seek parsimony
  – Parsimony  simplicity



                                              13
        Parsimony in Science
• A model should be:
   – 1: able to explain a lot
   – 2: use as few concepts as possible
• More it explains
   – The more you get
• Fewer concepts
   – The lower the price
• Is it worth paying a higher price for a better
  model?

                                                   14
            A Simple Model

• Height of five individuals
  – 1.40m
  – 1.55m
  – 1.80m
  – 1.62m
  – 1.63m
• These are our DATA
                               15
         A Little Notation
         The (vector of) data that we are
Y                   modelling

     The ith observation in our
Yi             data.


Y  4,5,6,7,8
Y2  5
                                            16
 Greek letters represent the true
    value in the population.


         (Beta) Parameters in our model
                 (population value)


0     The value of the first parameter of our
              model in the population.


j      The value of the jth parameter of our
             model, in the population.


       (Epsilon) The error in the population
                       model.
                                                 17
Normal letters represent the values in our
sample. These are sample statistics, which are
used to estimate population parameters.
         A parameters in our model (sample
b                    statistics)

e              The error in our sample.


        The data in our sample which we are
Y                 trying to model.




                                                 18
 Symbols on top change the meaning.


        The data in our sample which we are
Y           trying to model (repeated).




Yˆi      The estimated value of Y, for the ith
                       case.



                  The mean of Y.
Y
                                                 19
        ˆ
So b1  1

I will use b1 (because it is easier to type) 




                                                 20
• Not always that simple
  – some texts and computer programs use

  b = the parameter estimate (as we have
    used)
   (beta) = the standardised parameter
    estimate
  SPSS does this.




                                           21
A capital letter is the set (vector) of
parameters/statistics


B       Set of all parameters (b0, b1, b2, b3 … bp)

Rules are not used very consistently (even by
me).
Don‟t assume you know what someone means,
without checking.




                                                      22
• We want a model
  – To represent those data
• Model 1:
  – 1.40m, 1.55m, 1.80m, 1.62m, 1.63m
  – Not a model
     • A copy
  – VERY unparsimonious
• Data: 5 statistics
• Model: 5 statistics
  – No improvement
                                        23
• Model 2:
  – The mean (arithmetic mean)
  – A one parameter model

                            n
                             Yi
    ˆ  b Y 
   Yi                      i 1
         0
                                 n
                                     24
• Which, because we are lazy, can be
  written as


           Y
       Y 
            n

                                       25
The Mean as a Model




                      26
      The (Arithmetic) Mean
• We all know the mean
  – The „average‟
  – Learned about it at school
  – Forget (didn‟t know) about how clever the mean is
• The mean is:
  – An Ordinary Least Squares (OLS) estimator
  – Best Linear Unbiased Estimator (BLUE)



                                                   27
     Mean as OLS Estimator
• Going back a step or two
• MODEL was a representation of DATA
  – We said we want a model that explains a lot
  – How much does a model explain?
              DATA = MODEL + ERROR
              ERROR = DATA - MODEL
  – We want a model with as little ERROR as possible



                                                   28
• What is error?

      Data (Y)     Model (b0)   Error (e)
                     mean
       1.40                      -0.20
       1.55                      -0.05
       1.80          1.60         0.20
       1.62                       0.02
       1.63                       0.03


                                            29
• How can we calculate the „amount‟ of
  error?
• Sum of errors

ERROR  ei
         ˆ
 (Yi  Y )
 (Yi  b0 )
  0.20   0.05  0.20  0.02  0.03
0
                                         30
– 0 implies no ERROR
  • Not the case
– Knowledge about ERROR is useful
  • As we shall see later




                                    31
  • Sum of absolute errors
    – Ignore signs
ERROR   ei
          Yi  Yˆ

          Yi  b0

    0.20  0.05  0.20  0.02  0.03
         0.50
                                      32
• Are small and large errors equivalent?
   – One error of 4
   – Four errors of 1
      – The same?
– What happens with different data?
• Y = (2, 2, 5)
   – b0 = 2
   – Not very representative
• Y = (2, 2, 4, 4)
   – b0 = any value from 2 - 4
   – Indeterminate
      • There are an infinite number of solutions which would satisfy
        our criteria for minimum error
                                                               33
 • Sum of squared errors (SSE)
                    2
ERROR  e          i

         ˆ)2
 (Yi  Y
                2
 (Yi  b0 )
  0.20   0.05  0.20  0.02  0.03
           2            2   2     2         2


 0.08

                                       34
• Determinate
  – Always gives one answer
• If we minimise SSE
  – Get the mean
• Shown in graph
  – SSE plotted against b0
  – Min value of SSE occurs when
  – b0 = mean



                                   35
       2

      1.8

      1.6

      1.4

      1.2
SSE




       1

      0.8

      0.6

      0.4

      0.2

       0
            1   1.1   1.2   1.3   1.4   1.5   1.6   1.7   1.8   1.9        2

                                        b0


                                                                      36
The Mean as an OLS Estimate




                          37
     Mean as OLS Estimate
• The mean is an Ordinary Least Squares
  (OLS) estimate
  – As are lots of other things
• This is exciting because
  – OLS estimators are BLUE
  – Best Linear Unbiased Estimators
  – Proven with Gauss-Markov Theorem
     • Which we won‟t worry about

                                       38
           BLUE Estimators
• Best
  – Minimum variance (of all possible unbiased
    estimators
  – Narrower distribution than other estimators
     • e.g. median, mode
• Linear
  – Linear predictions      Y Y
  – For the mean
  – Linear (straight, flat) line

                                              39
• Unbiased
  – Centred around true (population) values
  – Expected value = population value
  – Minimum is biased.
     • Minimum in samples > minimum in population
• Estimators
  – Errrmm… they are estimators
• Also consistent
  – Sample approaches infinity, get closer to
    population values
  – Variance shrinks

                                                    40
     SSE and the Standard
          Deviation
• Tying up a loose end

                     ˆ)2
        SSE  (Yi  Y
                    ˆ)2
             (Yi  Y
        s
                  n
                    ˆ)2
             (Yi  Y
       
               n 1
                            41
• SSE closely related to SD
• Sample standard deviation – s
  – Biased estimator of population SD
• Population standard deviation - 
  – Need to know the mean to calculate SD
     • Reduces N by 1
     • Hence divide by N-1, not N
  – Like losing one df



                                            42
                    Proof
• That the mean minimises SSE
  – Not that difficult
  – As statistical proofs go
• Available in
  – Maxwell and Delaney – Designing
    experiments and analysing data
  – Judd and McClelland – Data Analysis (out
    of print?)
                                               43
             What‟s a df?
• The number of parameters free to vary
  – When one is fixed
• Term comes from engineering
  – Movement available to structures




                                          44
    0 df              1 df
No variation   Fix 1 corner, the
 available      shape is fixed




                                   45
         Back to the Data
• Mean has 5 (N) df
  – 1st moment
•  has N –1 df
  – Mean has been fixed
  – 2nd moment
  – Can think of as amount cases vary away
    from the mean

                                             46
       While we are at it …
• Skewness has N – 2 df
  – 3rd moment
• Kurtosis has N – 3 df
  – 4rd moment
  – Amount cases vary from 




                               47
          Parsimony and df
• Number of df remaining
  – Measure of parsimony
• Model which contained all the data
  – Has 0 df
  – Not a parsimonious model
• Normal distribution
  – Can be described in terms of mean and 
     • 2 parameters
  – (z with 0 parameters)
                                              48
       Summary of Lesson 1
• Statistics is about modelling DATA
  – Models have parameters
  – Fewer parameters, more parsimony, better
• Models need to minimise ERROR
  – Best model, least ERROR
  – Depends on how we define ERROR
  – If we define error as sum of squared deviations
    from predicted value
  – Mean is best MODEL
                                                      49
50
51
 Lesson 2: Models with one
more parameter - regression




                              52
     In Lesson 1 we said …
• Use a model to predict and describe
  data
  – Mean is a simple, one parameter model




                                            53
  More Models

Slopes and Intercepts




                        54
             More Models
• The mean is OK
  – As far as it goes
  – It just doesn‟t go very far
  – Very simple prediction, uses very little
    information
• We often have more information than
  that
  – We want to use more information than that


                                               55
              House Prices
• In the UK, two of the largest lenders
  (Halifax and Nationwide) compile house
  price indices
  – Predict the price of a house
  – Examine effect of different circumstances
• Look at change in prices
  – Guides legislation
     • E.g. interest rates, town planning


                                                56
Predicting House Prices
    Beds   £ (000s)
      1      77
      2      74
      1      88
      3      62
      5      90
      5      136
      2      35
      5      134
      4      138
      1      55
                          57
     One Parameter Model
• The mean

     Y  88.9
     ˆ
     Y  b0  Y
     SSE  11806.9
“How much is that house worth?”
“£88,900”
Use 1 df to say that              58
   Adding More Parameters
• We have more information than this
  – We might as well use it
  – Add a linear function of number of
    bedrooms (x1)


     ˆ
     Y  b0  b1 x1
                                         59
     Alternative Expression

• Estimate of Y (expected value of Y)

       ˆ
       Y  b0  b1 x1
• Value of Y

   Yi  b0  b1 xi1  ei
                                        60
            Estimating the Model
• We can estimate this model in four different,
  equivalent ways
     – Provides more than one way of thinking about it
1.   Estimating the slope which minimises SSE
2.   Examining the proportional reduction in SSE
3.   Calculating the covariance
4.   Looking at the efficiency of the predictions

                                                  61
Estimate the Slope to Minimise
             SSE




                             62
         Estimate the Slope
• Stage 1
  – Draw a scatterplot
  – x-axis at mean
     • Not at zero
• Mark errors on it
  – Called „residuals‟
  – Sum and square these to find SSE



                                       63
160
140
120
100
 80
      1.5   2   2.5   3   3.5   4   4.5   5    5.5
 60
 40
 20
  0

                                              64
160
140
120
100
 80
      1.5   2   2.5   3   3.5   4   4.5   5    5.5
 60
 40
 20
  0

                                              65
• Add another slope to the chart
  – Redraw residuals
  – Recalculate SSE
  – Move the line around to find slope which
    minimises SSE
• Find the slope




                                               66
• First attempt:




                   67
• Any straight line can be defined with
  two parameters
  – The location (height) of the slope
     • b0
        – Sometimes called a
  – The gradient of the slope
     • b1




                                          68
• Gradient



                      b1 units




             1 unit




                                 69
  • Height
b0 units




             70
• Height
• If we fix slope to zero
  – Height becomes mean
  – Hence mean is b0
• Height is defined as the point that the
  slope hits the y-axis
  – The constant
  – The y-intercept



                                            71
• Why the constant?
                             beds (x1) x0   £ (000s)
  – b0x0                        1       1       77
  – Where x0 is 1.00 for        2       1       74
    every case                  1       1       88
     • i.e. x0 is constant      3       1       62
• Implicit in SPSS              5       1       90
  – Some packages force         5       1      136
    you to make it              2       1       35
    explicit                    5       1      134
                                4       1      138
  – (Later on we‟ll need
                                1       1       55
    to make it explicit)

                                                  72
• Why the intercept?
  – Where the regression line intercepts the y-
    axis
  – Sometimes called y-intercept




                                              73
          Finding the Slope
• How do we find the values of b0 and b1?
  – Start with we jiggle the values, to find the
    best estimates which minimise SSE
  – Iterative approach
     • Computer intensive – used to matter, doesn‟t
       really any more
     • (With fast computers and sensible search
       algorithms – more on that later)


                                                      74
• Start with
  – b0=88.9 (mean)
  – b1=10 (nice round number)
     • SSE = 14948 – worse than it was
  – b0=86.9,   b1=10,   SSE=13828
  – b0=66.9,   b1=10,   SSE=7029
  – b0=56.9,   b1=10,   SSE=6628
  – b0=46.9,   b1=10,   SSE=8228
  – b0=51.9,   b1=10,   SSE=7178
  – b0=51.9,   b1=12,   SSE=6179
  – b0=46.9,   b1=14,   SSE=5957
  – ……..
                                         75
• Quite a long time later
  – b0 = 46.000372
  – b1 = 14.79182
  – SSE = 5921
• Gives the position of the
  – Regression line (or)
  – Line of best fit
     • Better than guessing
• Not necessarily the only method
  – But it is OLS, so it is the best (it is BLUE)

                                                    76
        160


        140


        120


        100
Price




        80


        60


        40                                           Actual Price
        20                                           Predicted Price
         0
              0.5   1   1.5   2     2.5   3    3.5     4    4.5   5    5.5

                                  Number of Bedrooms

                                                                             77
• We now know
  – A house with no bedrooms is worth 
    £46,000 (??!)
  – Adding a bedroom adds  £15,000
• Told us two things
  – Don‟t extrapolate to meaningless values of
    x-axis
  – Constant is not necessarily useful
     • It is necessary to estimate the equation



                                                  78
Standardised Regression Line
• One big but:
  – Scale dependent
• Values change
  – £ to €, inflation
• Scales change
  – £, £000, £00?
• Need to deal with this
                           79
• Don‟t express in „raw‟ units
  – Express in SD units
  – x1=1.72
  – y=36.21
• b1 = 14.79
• We increase x1 by 1, and Ŷ increases by
  14.79
    14.79  (14.79 / 36.21)SDs  0.408SDs


                                            80
• Similarly, 1 unit of x1 = 1/1.72 SDs
  – Increase x1 by 1 SD
  – Ŷ increases by 14.79  (1.72/1) = 8.60
• Put them both together


    b1   x1
        y
                                             81
    14.79 1.72
                 0.706
       36.21
• The standardised regression line
  – Change (in SDs) in Ŷ associated with a
    change of 1 SD in x1
• A different route to the same answer
  – Standardise both variables (divide by SD)
  – Find line of best fit

                                                82
• The standardised regression line has a
  special name
    The Correlation Coefficient
                (r)
  (r stands for „regression‟, but more on that
    later)
• Correlation coefficient is a standardised
  regression slope
  – Relative change, in terms of SDs

                                                 83
Proportional Reduction in
          Error




                            84
Proportional Reduction in Error
• We might be interested in the level of
  improvement of the model
  – How much less error (as proportion) do we
    have
  – Proportional Reduction in Error (PRE)
• Mean only
  – Error(model 0) = 11806
• Mean + slope
  – Error(model 1) = 5921
                                            85
      ERROR(0)  ERROR(1)
PRE 
          ERROR(0)
          ERROR(1)
PRE  1 
          ERROR(0)
           5921
PRE  1 
          11806
PRE  0.4984
                            86
• But we squared all the errors in the first
  place
   – So we could take the square root
   – (It‟s a shoddy excuse, but it makes the
     point)

         0.4984  0.706
• This is the correlation coefficient
• Correlation coefficient is the square root
  of the proportion of variance explained
                                               87
Standardised Covariance




                          88
    Standardised Covariance
• We are still iterating
  – Need a „closed-form‟
  – Equation to solve to get the parameter
    estimates
• Answer is a standardised covariance
  – A variable has variance
  – Amount of „differentness‟
• We have used SSE so far
                                             89
• SSE varies with N
  – Higher N, higher SSE
• Divide by N
  – Gives SSE per person
  – (Actually N – 1, we have lost a df to the
    mean)
• The variance
• Same as SD2
  – We thought of SSE as a scattergram
     • Y plotted against X
  – (repeated image follows)

                                                90
160
140
120
100
 80
      1.5   2   2.5   3   3.5   4   4.5   5   5.5
 60
 40
 20
  0


                                               91
• Or we could plot Y against Y
  – Axes meet at the mean (88.9)
  – Draw a square for each point
  – Calculate an area for each square
  – Sum the areas
• Sum of areas
  – SSE
• Sum of areas divided by N
  – Variance


                                        92
          Plot of Y against Y
                   180

                   160

                   140

                   120

                   100

                    80
0   20   40   60   80    100   120   140   160    180

                    60

                    40

                    20

                     0
                                                 93
                     Draw Squares
                                180
                                       138 – 88.9
             Area =
                                160
                                         = 40.1
           40.1 x 40.1          140
            = 1608.1
                                120                         138 – 88.9
                                100
                                                              = 40.1

                                 80
0     20       40        60     80    100   120       140   160    180
35 – 88.9
                                 60
 = -53.9
                                 40                  Area =
                                 20
                                                  -53.9 x -53.9
                    35 – 88.9                      = 2905.21
                     = -53.9      0
                                                                  94
• What if we do the same procedure
    – Instead of Y against Y
    – Y against X
•   Draw rectangles (not squares)
•   Sum the area
•   Divide by N - 1
•   This gives us the variance of x with y
    – The Covariance
    – Shortened to Cov(x, y)

                                             95
96
         Area
    = (-33.9) x (-2)                     4-3=1
        = 67.8

55 – 88.9                     138-88.9
 = -33.9                       = 49.1
                 1 - 3 = -2
                                           Area =
                                          49.1 x 1
                                           = 49.1

                                                     97
• More formally (and easily)
• We can state what we are doing as an
  equation
  – Where Cov(x, y) is the covariance

              ( x  x )( y  y )
Cov( x, y ) 
                    N 1
• Cov(x,y)=44.2
• What do points in different sectors do
  to the covariance?
                                           98
• Problem with the covariance
  – Tells us about two things
  – The variance of X and Y
  – The covariance
• Need to standardise it
  – Like the slope
• Two ways to standardise the covariance
  – Standardise the variables first
     • Subtract from mean and divide by SD
  – Standardise the covariance afterwards

                                             99
• First approach
  – Much more computationally expensive
     • Too much like hard work to do by hand
  – Need to standardise every value
• Second approach
  – Much easier
  – Standardise the final value only
• Need the combined variance
  – Multiply two variances
  – Find square root (were multiplied in first
    place)

                                                 100
• Standardised covariance

          Cov( x , y )
   
        Var( x )  Var( y )
         44.2
   
       2.9  1311
    0.706

                              101
• The correlation coefficient
  – A standardised covariance is a correlation
    coefficient




                Covariance
   r
           variance  variance 

                                                 102
• Expanded …


         ( x  x )( y  y ) 
                             
               N 1          
r
      ( x  x ) ( y  y ) 
                 2              2
     
      N 1                      
                                  
                       N 1 


                                      103
• This means …
  – We now have a closed form equation to
    calculate the correlation
  – Which is the standardised slope
  – Which we can use to calculate the
    unstandardised slope




                                            104
We know that:

                b1   x1
       r
                  y
We know that:


                  r  y
       b1 
                   x  1    105
            r  y
     b1 
              x 1


          0.706  36.21
     b1 
              1.72
     b1  14.79

• So value of b1 is the same as the iterative
  approach
                                                106
• The intercept
  – Just while we are at it
• The variables are centred at zero
  – We subtracted the mean from both
    variables
  – Intercept is zero, because the axes cross at
    the mean




                                              107
• Add mean of y to the constant
  – Adjusts for centring y
• Subtract mean of x
  – But not the whole mean of x
  – Need to correct it for the slope
     c  y  b1 x1
     c  88.9  14.8  3
     c  46.00
  • Naturally, the same
                                       108
Accuracy of Prediction




                         109
      One More (Last One)
• We have one more way to calculate the
  correlation
  – Looking at the accuracy of the prediction
• Use the parameters
  – b0 and b1
  – To calculate a predicted value for each
    case



                                                110
           Actual Predicted
Beds
           Price    Price     • Plot actual price
       1        77    60.80     against
       2        74    75.59     predicted price
       1        88    60.80
                                – From the model
       3        62    90.38
       5        90   119.96
       5       136   119.96
       2        35    75.59
       5       134   119.96
       4       138   105.17
       1        55    60.80

                                              111
              140

              120
Predicted Value




              100

                  80

                  60

                  40

                  20
                       20   40   60   80       100   120   140   160
                                      Actual Value



                                                                 112
• r = 0.706
  – The correlation
• Seems a futile thing to do
  – And at this stage, it is
  – But later on, we will see why




                                    113
       Some More Formulae
• For hand calculation
                       xy
               r
                      x 2y 2

• Point biserial

                   r
                      M   y1    M y 0  PQ
                                  sd y


                                                114
• Phi (f)
  – Used for 2 dichotomous variables

                         Vote P    Vote Q

    Homeowner          A: 19      B: 54

    Not homeowner      C: 60      D:53


                BC  AD
   r
      ( A  B)(C  D)( A  C )( B  D)

                                            115
• Problem with the phi correlation
  – Unless Px= Py (or Px = 1 – Py)
     • Maximum (absolute) value is < 1.00
     • Tetrachoric can be used
• Rank (Spearman) correlation
  – Used where data are ranked


             6d       2
         r
            n(n  1)
               2



                                            116
                Summary
• Mean is an OLS estimate
  – OLS estimates are BLUE
• Regression line
  – Best prediction of DV from IV
  – OLS estimate (like mean)
• Standardised regression line
  – A correlation


                                    117
• Four ways to think about a correlation
  – 1.   Standardised regression line
  – 2.   Proportional Reduction in Error (PRE)
  – 3.   Standardised covariance
  – 4.   Accuracy of prediction




                                                 118
119
120
Lesson 3: Why Regression?

  A little aside, where we look at
 why regression has such a curious
                name.

                                     121
              Regression
 The or an act of regression; reversion;
  return towards the mean; return to an
   earlier stage of development, as in an
  adult‟s or an adolescent‟s behaving like
                    a child
  (From Latin gradi, to go)
• So why name a statistical technique
  which is about prediction and
  explanation?

                                        122
• Francis Galton
  – Charles Darwin‟s cousin
  – Studying heritability
• Tall fathers have shorter sons
• Short fathers have taller sons
  – „Filial regression toward mediocrity‟
  – Regression to the mean




                                            123
• Galton thought this was biological fact
  – Evolutionary basis?
• Then did the analysis backward
  – Tall sons have shorter fathers
  – Short sons have taller fathers
• Regression to the mean
  – Not biological fact, statistical artefact




                                                124
             Other Examples
• Secrist (1933): The Triumph of Mediocrity in
  Business
• Second albums often tend to not be as good
  as first
• Sequel to a film is not as good as the first
  one
• „Curse of Athletics Weekly‟
• Parents think that punishing bad behaviour
  works, but rewarding good behaviour doesn‟t

                                             125
          Pair Link Diagram

• An alternative to a scatterplot




      x                      y      126
r=1.00

                                 x
                             x
                         x
                     x
                 x
             x
         x


                                     127
r=0.00

     x       x

         x

     x       x




                 128
       From Regression to
           Correlation
• Where do we predict an individual‟s
  score on y will be, based on their score
  on x?
  – Depends on the correlation
• r = 1.00 – we know exactly where they
  will be
• r = 0.00 – we have no idea
• r = 0.50 – we have some idea
                                         129
                  r=1.00

    Starts here

                     Will end up
                         here




x                                  y
                                       130
           r=0.00

    Starts here

                    Could end
                  anywhere here




x                                 y
                                      131
              r=0.50
                            Probably
                              end
    Starts here            somewhere
                              here




x                      y




                                       132
    Galton Squeeze Diagram
• Don‟t show individuals
  – Show groups of individuals, from the same
    (or similar) starting point
  – Shows regression to the mean




                                            133
         r=0.00
                    Ends here




    Group starts
       here



x    Group starts
                                y
        here
                                    134
    r=0.50




x            y
                 135
    r=1.00




x            y
                 136
  1 unit                               r units




           x                       y

• Correlation is amount of regression that
  doesn‟t occur


                                           137
        • No regression
          • r=1.00




x   y




                    138
        • Some
          regression
           • r=0.50



x   y



                       139
    r=0.00

                 • Lots
                   (maximum)
                   regression
                    • r=0.00



x            y

                            140
  Formula


z y  rxy z x
ˆ



                141
                  Conclusion
• Regression towards mean is statistical necessity
       regression = perfection – correlation
• Very non-intuitive
• Interest in regression and correlation
  – From examining the extent of regression towards
    mean
  – By Pearson – worked with Galton
  – Stuck with curious name
• See also Paper B3

                                                      142
143
144
   Lesson 4: Samples to
Populations – Standard Errors
 and Statistical Significance




                            145
            The Problem
• In Social Sciences
  – We investigate samples
• Theoretically
  – Randomly taken from a specified
    population
  – Every member has an equal chance of
    being sampled
  – Sampling one member does not alter the
    chances of sampling another
• Not the case in (say) physics, biology,
  etc.                                       146
               Population
• But it‟s the population that we are
  interested in
  – Not the sample
  – Population statistic represented with Greek
    letter
  – Hat means „estimate‟         ˆ
                               b
                               x  x
                                   ˆ         147
• Sample statistics (e.g. mean) estimate
  population parameters
• Want to know
  – Likely size of the parameter
  – If it is > 0




                                           148
      Sampling Distribution
• We need to know the sampling
  distribution of a parameter estimate
  – How much does it vary from sample to
    sample
• If we make some assumptions
  – We can know the sampling distribution of
    many statistics
  – Start with the mean
                                               149
 Sampling Distribution of the
           Mean
• Given
  – Normal distribution
  – Random sample
  – Continuous data
• Mean has a known sampling distribution
  – Repeatedly sampling will give a known
    distribution of means
  – Centred around the true (population) mean
    ()
                                           150
  Analysis Example: Memory
• Difference in memory for different
  words
  – 10 participants given a list of 30 words to
    learn, and then tested
  – Two types of word
     • Abstract: e.g. love, justice
     • Concrete: e.g. carrot, table



                                                  151
Concrete Abstract   Diff (x)
      12        4           8
      11        7           4   x  2.1
       4        6          -2
       9       12          -3    x  3.11
       8        6           2
      12       10           2   N  10
       9        8           1
       8        5           3
      12       10           2
       8        4           4
                                          152
       Confidence Intervals
• This means
  – If we know the mean in our sample
  – We can estimate where the mean in the
    population () is likely to be
• Using
  – The standard error (se) of the mean
  – Represents the standard deviation of the
    sampling distribution of the mean


                                               153
1 SD contains
    68%


Almost 2 SDs
contain 95%




                154
• We know the sampling distribution of
  the mean
  – t distributed
  – Normal with large N (>30)
• Know the range within means from
  other samples will fall
  – Therefore the likely range of 

                    x
          se( x ) 
                     n
                                         155
• Two implications of equation
  – Increasing N decreases SE
     • But only a bit
  – Decreasing SD decreases SE
• Calculate Confidence Intervals
  – From standard errors
• 95% is a standard level of CI
  – 95% of samples the true mean will lie within
    the 95% CIs
  – In large samples: 95% CI = 1.96  SE
  – In smaller samples: depends on t
    distribution (df=N-1=9)
                                             156
x  2.1,
 x  3.11,
N  10
           x 3.11
se( x )           0.98
           n   10
                            157
95% CI  2.26  0.98  2.22


    x  CI    x  CI
    -0.12    4.32

                          158
           What is a CI?
• (For 95% CI):
• 95% chance that the true (population)
  value lies within the confidence
  interval?
• 95% of samples, true mean will land
  within the confidence interval?


                                          159
          Significance Test
• Probability that  is a certain value
  – Almost always 0
     • Doesn‟t have to be though
• We want to test the hypothesis that the
  difference is equal to 0
  – i.e. find the probability of this difference
    occurring in our sample IF =0
  – (Not the same as the probability that =0)
                                              160
• Calculate SE, and then t
  – t has a known sampling distribution
  – Can test probability that a certain value is
    included


                            2.1
     x                  t       2.14
t                         0.98
   se(x )
                        p  0.061

                                                   161
  Other Parameter Estimates
• Same approach
  – Prediction, slope, intercept, predicted
    values
  – At this point, prediction and slope are the
    same
     • Won‟t be later on
• We will look at one predictor only
  – More complicated with > 1
                                                  162
      Testing the Degree of
            Prediction
• Prediction is correlation of Y with Ŷ
  – The correlation – when we have one IV
• Use F, rather than t
• Started with SSE for the mean only
  – This is SStotal
  – Divide this into SSresidual
  – SSregression
• SStot = SSreg + SSres
                                            163
     SSreg df1
F
     SS res df 2

               df1  k
               df 2  N  k  1,
                              164
• Back to the house prices
  – Original SSE (SStotal) = 11806
  – SSresidual = 5921
     • What is left after our model
  – SSregression = 11806 – 5921 = 5885
     • What our model explains
• Slope = 14.79
• Intercept = 46.0
• r = 0.706

                                         165
     SSreg df1
F
     SS res df 2

            5885 1
     F                    7.95
        5921 (10  1  1)
     df1  k  1
     df 2  N  k  1  8
                               166
• F = 7.95, df = 1, 8, p = 0.02
  – Can reject H0
     • H0: Prediction is not better than chance
  – A significant effect




                                                  167
 Statistical Significance:
What does a p-value (really)
         mean?



                               168
                     A Quiz

• Six questions, each true or false
• Write down your answers (if you like)

• An experiment has been done. Carried out
  perfectly. All assumptions perfectly satisfied.
  Absolutely no problems.
• P = 0.01
   – Which of the following can we say?
                                                169
1. You have absolutely disproved the null
   hypothesis (that is, there is no
   difference between the population
   means).




                                       170
2. You have found the probability of the
   null hypothesis being true.




                                       171
3. You have absolutely proved your
   experimental hypothesis (that there is
   a difference between the population
   means).




                                        172
4. You can deduce the probability of the
   experimental hypothesis being true.




                                       173
5. You know, if you decide to reject the
   null hypothesis, the probability that
   you are making the wrong decision.




                                           174
6. You have a reliable experimental
   finding in the sense that if,
   hypothetically, the experiment were
   repeated a great number of times, you
   would obtain a significant result on
   99% of occasions.


                                      175
      OK, What is a p-value
• Cohen (1994)
  “[a p-value] does not tell us what we
  want to know, and we so much want to
   know what we want to know that, out
  of desperation, we nevertheless believe
              it does” (p 997).


                                       176
      OK, What is a p-value
• Sorry, didn‟t answer the question
• It‟s The probability of obtaining a result
  as or more extreme than the result we
  have in the study, given that the null
  hypothesis is true
• Not probability the null hypothesis is
  true

                                           177
            A Bit of Notation
• Not because we like notation
    – But we have to say a lot less


•   Probability – P
•   Null hypothesis is true – H
•   Result (data) – D
•   Given - |
                                      178
           What‟s a P Value
• P(D|H)
  – Probability of the data occurring if the null
    hypothesis is true
• Not
• P(H|D)
  – Probability that the null hypothesis is true,
    given that we have the data = p(H)
• P(H|D) ≠ P(D|H)
                                                179
• What is probability you are prime minister
  – Given that you are british
  – P(M|B)
  – Very low
• What is probability you are British
  – Given you are prime minister
  – P(B|M)
  – Very high
• P(M|B) ≠ P(B|M)

                                          180
• There‟s been a murder
  – Someone bumped off a statto for talking too
    much
• The police have DNA
• The police have your DNA
  – They match(!)
• DNA matches 1 in 1,000,000 people
• What‟s the probability you didn‟t do the
  murder, given the DNA match (H|D)

                                             181
• Police say:
  – P(D|H) = 1/1,000,000
• Luckily, you have Jeremy on your defence
  team
• We say:
  – P(D|H) ≠ P(H|D)
• Probability that someone matches the
  DNA, who didn‟t do the murder
  – Incredibly high


                                         182
     Back to the Questions
• Haller and Kraus (2002)
  – Asked those questions of groups in
    Germany
  – Psychology Students
  – Psychology lecturers and professors (who
    didn‟t teach stats)
  – Psychology lecturers and professors (who
    did teach stats)
                                               183
1. You have absolutely disproved the null
   hypothesis (that is, there is no difference
   between the population means).
    •   True
        •   34% of students
        •   15% of professors/lecturers,
        •   10% of professors/lecturers teaching statistics
•   False
•   We have found evidence against the null
    hypothesis


                                                          184
2. You have found the probability of the
   null hypothesis being true.
    – 32% of students
    – 26% of professors/lecturers
    – 17% of professors/lecturers teaching
      statistics
•   False
•   We don‟t know



                                             185
3. You have absolutely proved your
   experimental hypothesis (that there is a
   difference between the population means).
    –   20% of students
    –   13% of professors/lecturers
    –   10% of professors/lecturers teaching statistics
•   False


                                                          186
4. You can deduce the probability of the
   experimental hypothesis being true.
    – 59% of students
    – 33% of professors/lecturers
    – 33% of professors/lecturers teaching
      statistics
•   False




                                             187
5. You know, if you decide to reject the null
   hypothesis, the probability that you are
   making the wrong decision.
    •   68% of students
    •   67% of professors/lecturers
    •   73% of professors professors/lecturers
        teaching statistics
•   False
•   Can be worked out
    – P(replication)


                                                 188
6. You have a reliable experimental finding
   in the sense that if, hypothetically, the
   experiment were repeated a great
   number of times, you would obtain a
   significant result on 99% of occasions.
    – 41% of students
    – 49% of professors/lecturers
    – 37% of professors professors/lecturers
      teaching statistics
•   False
•   Another tricky one
    – It can be worked out
                                               189
            One Last Quiz
• I carry out a study
  – All assumptions perfectly satisfied
  – Random sample from population
  – I find p = 0.05
• You replicate the study exactly
  – What is probability you find p < 0.05?


                                             190
• I carry out a study
  – All assumptions perfectly satisfied
  – Random sample from population
  – I find p = 0.01
• You replicate the study exactly
  – What is probability you find p < 0.05?


                                             191
• Significance testing creates boundaries
  and gaps where none exist.
• Significance testing means that we find
  it hard to build upon knowledge
  – we don‟t get an accumulation of
    knowledge



                                        192
• Yates (1951)
"the emphasis given to formal tests of significance
   ... has resulted in ... an undue concentration of
          effort by mathematical statisticians on
   investigations of tests of significance applicable
     to problems which are of little or no practical
     importance ... and ... it has caused scientific
   research workers to pay undue attention to the
     results of the tests of significance ... and too
    little to the estimates of the magnitude of the
               effects they are investigating

                                                  193
         Testing the Slope
• Same idea as with the mean
  – Estimate 95% CI of slope
  – Estimate significance of difference from a
    value (usually 0)
• Need to know the sd of the slope
  – Similar to SD of the mean



                                                 194
           (Y  Yˆ )2
s y. x 
            N  k 1

              SSres
s y. x 
            N  k 1

           5921
s y. x          27.2
            8            195
• Similar to equation for SD of mean
• Then we need standard error
   - Similar (ish)
• When we have standard error
  – Can go on to 95% CI
  – Significance of difference




                                       196
                 s y.x
se(by. x ) 
                            2
               ( x  x )

             27.2
se(by. x )         5.24
              26.9

                                197
• Confidence Limits
• 95% CI
  – t dist with N - k - 1 df is 2.31
  – CI = 5.24  2.31 = 12.06
• 95% confidence limits


    14.8  12.1    14.8  12.1
    2.7    26.9

                                       198
• Significance of difference from zero
   – i.e. probability of getting result if =0
      • Not probability that  = 0

           b     14.7
      t              2.81
         se(b)    5.2
      df  N  k  1  8
      p  0.02
• This probability is (of course) the same
as the value for the prediction
                                                 199
   Testing the Standardised
      Slope (Correlation)
• Correlation is bounded between –1 and +1
  – Does not have symmetrical distribution, except
    around 0
• Need to transform it
  – Fisher z‟ transformation – approximately
    normal

    z  0.5[ln(1  r )  ln(1  r )]
                   1
          SE z 
                  n3                          200
z  0.5[ln(1  0.706)  ln(1  0.706)]
z  0.879
        1     1
SEz               0.38
       n3   10  3
• 95% CIs
  – 0.879 – 1.96 * 0.38 = 0.13
  – 0.879 + 1.96 * 0.38 = 1.62



                                      201
• Transform back to correlation

       e 1
          2y
    r  2y
       e 1

• 95% CIs = 0.13 to 0.92
• Very wide
  – Small sample size
  – Maybe that‟s why CIs are not reported?

                                             202
              Using Excel
• Functions in excel
  – Fisher() – to carry out Fisher
    transformation
  – Fisherinv() – to transform back to
    correlation




                                         203
              The Others
• Same ideas for calculation of CIs and
  SEs for
  – Predicted score
  – Gives expected range of values given X
• Same for intercept
  – But we have probably had enough



                                             204
Lesson 5: Introducing Multiple
         Regression




                             205
                  Residuals
• We said
                      Y = b0 + b1x1
• We could have said
                   Yi = b0 + b1xi1 + ei
• We ignored the i on the Y
• And we ignored the ei
  – It‟s called error, after all
• But it isn‟t just error
  – Trying to tell us something
                                          206
        What Error Tells Us
• Error tells us that a case has a different
  score for Y than we predict
  – There is something about that case
• Called the residual
  – What is left over, after the model
• Contains information
  – Something is making the residual  0
  – But what?

                                           207
        160


        140


        120             swimming pool
        100
Price




         80

                                              Unpleasant
         60
                                              neighbours
         40                                            Actual Price
         20                                            Predicted Price
          0
              0.5   1   1.5   2     2.5   3     3.5     4     4.5   5    5.5

                                  Number of Bedrooms

                                                                               208
• The residual (+ the mean) is the value
  of Y
        If all cases were equal on X
• It is the value of Y, controlling for X
• Other words:
  – Holding constant
  – Partialling
  – Residualising
  – Conditioned on


                                            209
Beds £ (000s)
            Pred     Res    Adj. Value
  1      77     61      -16         105
  2      74     76        2          90
  1      88     61      -27          62
  3      62     90       28         117
  5      90    120       30         119
  5      136   120      -16          73
  2      35     76       41         129
  5      134   120      -14          75
  4      138   105      -33          56
  1      55     61        6          95
                                  210
• Sometimes adjustment is enough on its own
  – Measure performance against criteria
• Teenage pregnancy rate
  – Measure pregnancy and abortion rate in areas
  – Control for socio-economic deprivation, and
    anything else important
  – See which areas have lower teenage pregnancy
    and abortion rate, given same level of deprivation
• Value added education tables
  – Measure school performance
  – Control for initial intake


                                                    211
                 Control?
• In experimental research
  – Use experimental control
  – e.g. same conditions, materials, time of
    day, accurate measures, random
    assignment to conditions
• In non-experimental research
  – Can‟t use experimental control
  – Use statistical control instead


                                               212
       Analysis of Residuals
• What predicts differences in crime rate
  – After controlling for socio-economic
    deprivation
  – Number of police?
  – Crime prevention schemes?
  – Rural/Urban proportions?
  – Something else
• This is what regression is about

                                           213
• Exam performance
  – Consider number of books a student read
    (books)
  – Number of lectures (max 20) a student
    attended (attend)
• Books and attend as IV, grade as DV




                                              214
Book s       Attend    Grade
         0         9        45
         1        15        57
         0        10        45   First 10 cases
         2        16        51
         4        10        65
         4        20        88
         1        11        44
         4        20        87
         3        15        89
         0        15        59


                                              215
• Use books as IV
  – R=0.492, F=12.1, df=1, 28, p=0.001
  – b0=52.1, b1=5.7
  – (Intercept makes sense)
• Use attend as IV
  – R=0.482, F=11.5, df=1, 38, p=0.002
  – b0=37.0, b1=1.9
  – (Intercept makes less sense)




                                         216
              100


               90


               80


               70


               60


               50
Grade (100)




               40

               30
                    -1      0   1   2   3   4   5

                    Books
                                                217
        100


         90


         80


         70


         60


         50


         40
Grade




         30
              5        7   9   11   13   15   17   19   21


              Attend
                                                             218
               Problem
• Use R2 to give proportion of shared
  variance
  – Books = 24%
  – Attend = 23%
• So we have explained 24% + 23% =
  47% of the variance
  – NO!!!!!


                                        219
• Look at the correlation matrix

      BOOKS      1

      ATTEND    0.44      1

      GRADE     0.49    0.48      1

               BOOKS   ATTEND   GRADE

• Correlation of books and attend is
  (unsurprisingly) not zero
  – Some of the variance that books shares
    with grade, is also shared by attend
                                             220
• I have access to 2 cars
• My wife has access to 2 cars
  – We have access to four cars?
  – No. We need to know how many of my 2
    cars are the same cars as her 2 cars
• Similarly with regression
  – But we can do this with the residuals
  – Residuals are what is left after (say) books
  – See of residual variance is explained by
    attend
  – Can use this new residual variance to
    calculate SSres, SStotal and SSreg
                                              221
• Well. Almost.
  – This would give us correct values for SS
  – Would not be correct for slopes, etc
• Assumes that the variables have a
  causal priority
  – Why should attend have to take what is
    left from books?
  – Why should books have to take what is left
    by attend?
• Use OLS again

                                               222
• Simultaneously estimate 2 parameters
  – b1 and b2
  – Y = b0 + b1x1 + b2x2
  – x1 and x2 are IVs
• Not trying to fit a line any more
  – Trying to fit a plane
• Can solve iteratively
  – Closed form equations better
  – But they are unwieldy


                                         223
3D scatterplot
(2points only)
                           y




                      x2


                 x1
                               224
                        b2


                    y




          b1
b0             x2


     x1
                         225
          (Really) Ridiculous Equations

b1 
      y  y x1  x1 x2  x2 2    y  y x2  x2 x1  x1 x2  x2 
                    x1  x1  x2  x2   x1  x1 x2  x2 
                                2             2                          2




b2 
      y  y x2  x2 x1  x1 2    y  y x1  x1 x2  x2 x1  x1 
                    x2  x2  x1  x1   x2  x2 x1  x1 
                                2             2                          2




                          b0  y  b1 x1  b2 x2
                                                                                  226
• The good news
  – There is an easier way
• The bad news
  – It involves matrix algebra
• The good news
  – We don‟t really need to know how to do it
• The bad news
  – We need to know it exists



                                            227
 A Quick Guide to Matrix
         Algebra
(I will never make you do it again)




                                      228
  Very Quick Guide to Matrix
           Algebra
• Why?
  – Matrices make life much easier in
    multivariate statistics
  – Some things simply cannot be done
    without them
  – Some things are much easier with them
• If you can manipulate matrices
  – you can specify calculations v. easily
  – e.g. AA’ = sum of squares of a column
    • Doesn‟t matter how long the column
                                             229
 • A scalar is a number
      A scalar: 4
 • A vector is a row or column of numbers



 A row vector:      2   4 8 7


                     5
                      
                     11 
A column vector:      
                                        230
• A vector is described as rows x columns

          2    4 8 7
  – Is a 1  4 vector

                        5
                         
                        11 
                         
  – Is a 2  1 vector
  – A number (scalar) is a 1  1 vector

                                          231
 • A matrix is a rectangle, described as
   rows x columns


           2 6 5 7 8
                    
           4 5 7 5 3
           1 5 2 7 8
                    
• Is a 3 x 5 matrix
• Matrices are referred to with bold capitals
  - A is a matrix                           232
• Correlation matrices and covariance
  matrices are special
  – They are square and symmetrical
  – Correlation matrix of books, attend and
    grade

      1.00 0.44 0.49 
                     
      0.44 1.00 0.48 
      0.49 0.48 1.00 
                     
                                              233
• Another special matrix is the identity
  matrix I
   – A square matrix, with 1 in the diagonal and
     0 in the off-diagonal

                 1      0 0 0
                             
                 0      1 0 0
               I
                   0     0 1 0
                 
                 0           
                        0 0 1
                              
– Note that this is a correlation matrix, with
  correlations all = 0
                                                 234
         Matrix Operations
• Transposition
  – A matrix is transposed by putting it on its
    side
  – Transpose of A is A’ A  7 5 6 
                                 7
                                  
                            A'   5 
                                 6
                                  
                                                  235
• Matrix multiplication
  – A matrix can be multiplied by a scalar, a
    vector or a matrix
  – Not commutative
  – AB  BA
  – To multiply AB
     • Number of rows in A must equal number of
       columns in B




                                                  236
     • Matrix by vector


a     d   g   j   aj  dk  gl 
b     e        k    bj  ek  hl 
           h   
                                       
c             l   cj  fk  il 
      f   i                        
   2 3 5  2   33 
               
 2 3 5   2   4    20   43 
                             9
       11   3 
  7 11 1313  3   99   52    90 
                 
 17 19 23  4
   7                   14 33              
             
                  141              
 17 19 23   4          34  57  92     183 
                                                  237
• Matrix by matrix

 a b  e     f   ae  cf af  bh 
 
  c d  g
               
                   ce  dg cf  dh 
                                     
            h                   

  2 3   2 3   4  12 6  15 
  5 7    4 5   10  28 15  35 
                                 
                                 
     16 21
   38 50 
            
                                          238
• Multiplying by the identity matrix
  – Has no effect
  – Like multiplying by 1

   AI  A
  2 3 1 0 2 3
               
  5 7   0 1  5 7 
                 


                                       239
• The inverse of J is: 1/J
• J x 1/J = 1
• Same with matrices
  – Matrices have an inverse
  – Inverse of A is A-1
  – AA-1=I
• Inverting matrices is dull
  – We will do it once
  – But first, we must calculate the
    determinant
                                       240
• The determinant of A is |A|
• Determinants are important in statistics
  – (more so than the other matrix algebra)
• We will do a 2x2
  – Much more difficult for larger matrices




                                              241
    a b
 Ac d  
          
 A  ad  cb

   1.0 0.3 
A
   0.3 1.0 
            
           
A  1  1  0.3  0.3
A  0.91
                        242
• Determinants are important because
  – Needs to be above zero for regression to
    work
  – Zero or negative determinant of a
    correlation/covariance matrix means
    something wrong with the data
     • Linear redundancy
• Described as:
  – Not positive definite
  – Singular (if determinant is zero)
     • In different error messages

                                               243
• Next, the adjoint
               a b 
             A
               c d 
                    
                   
                      d  b
             adj A  
                      c a 
                            
                           
•Now

        1 1
       A    adj A
           A
                                244
• Find A-1

     1.0 0.3 
  A
     0.3 1.0 
              
             
   A  0.91

     1      1    1.0  0.3 
 A             
                   0.3 1.0 
                             
            0.91            
     1     1.10  0.33 
 A        
             0.33 1.10 
                         
                        
                                 245
Matrix Algebra with
Correlation Matrices




                       246
            Determinants
• Determinant of a correlation matrix
  – The volume of „space‟ taken up by the
    (hyper) sphere that contains all of the
    points

       1.0 0.0 
   A          
       0.0 1.0 
   A  1.0
                                              247
X        X

    X

X        X




            1.0 0.0 
        A          
            0.0 1.0 
        A  1.0          248
        X

    X

X




            1.0 1.0 
        A          
            1.0 1.0 
        A  0.0          249
      Negative Determinant
• Points take up less than no space
  – Correlation matrix cannot exist
  – Non-positive definite matrix




                                      250
  Sometimes Obvious

  1.0 1.2 
A        
  1.2 1.0 
          
A  0.44

                      251
Sometimes Obvious (If You
         Think)
   1    0.9 0.9 
   0.9
A       1        
              0.9 
   0.9 0.9      
              1 

 A  2.88
                            252
         Sometimes No Idea
   1.00 0.76 0.40 
   0.76
A        1        
               0.30 
   0.40 0.30  1 
                   

  A  0.01      1.00 0.75 0.40 
                 0.75
              A        1        
                             0.30 
                 0.40 0.30  1 
                                 
                 A  0.0075           253
  Multiple R for Each Variable
• Diagonal of inverse of correlation matrix
  – Used to calculate multiple R
  – Call elements aij


                        1
     Ri .123...k    1
                        aii
                                         254
       Regression Weights
• Where i is DV
• j is IV


                 aij
      bi . j 
                  aij
                            255
    Back to the Good News
• We can calculate the standardised
  parameters as
             B=Rxx-1 x Rxy
• Where
  – B is the vector of regression weights
  – Rxx-1 is the inverse of the correlation matrix
    of the independent (x) variables
  – Rxy is the vector of correlations of the
    correlations of the x and y variables
  – Now do exercise 3.2

                                                256
          One More Thing

• The whole regression equation can be
  described with matrices
  – very simply



  Y  XB  E
                                         257
• Where
  – Y = vector of DV
  – X = matrix of IVs
  – B = vector of coefficients
• Go all the way back to our example




                                       258
1   0    9            e1   45 
1   1    5           e   57 
                       2  
1   0   10            e3   45 
                        
1   2   16 
               b0     e4   51 
1   4   10     e5   65 
             b1       
1   4   20     e6   88 
              2   e7   44 
                b
1   1   11
                        
1   4   20            e8   87 
1   3   15            e   89 

1                     9  
    0   15 
            
                        e   59 
                        10   

                                       259
                 The constant – literally a
                 constant. Could be any
1   0    9        e1   45 
                number, but it is most
                      
1   1    5        e2   57 
              convenient to make it 1. Used
1   0   10        e   45 
                 „capture‟ 
                to  3   the intercept.
1   2   16           e4   51 
1   4       b0   e   65 
         10  
            b1    5    
1   4   20    e6   88 
            b2   e   
1   1   11           7   44 
1   4   20           e8   87 
                       
1   3   15           e9   89 
1   0   15           e   59 
                     10   

                                      260
1   0    9           e1   45 
                       
1   1    5           e2   57 
1   0   10           e   45 
                     3  
1   2   16  The matrix 51 values for
                       e4   of
1   4       b0   (books65  attend)
         10     IVs e5   and
            b1       
1   4   20    e6   88 
            b2   e   
1   1   11           7   44 
1   4   20           e8   87 
                       
1   3   15           e9   89 
1   0   15           e   59 
                     10   

                                       261
               1 0 9             e1   45 
                                   
               1 1 5             e2   57 
               1 0 10            e   45 
                                 3  
               1 2 16            e4   51 
               1 4 10  b0   e   65 
                           
    The parameter
                        b1    5    
  estimates. We are 20    e6   88 
               1 4
               
trying to find 1 1 11 
               the best  b2   e   
                                   7   44 
    values of these. 20 
               1 4                e8   87 
                                   
               1 3 15            e9   89 
               1 0 15            e   59 
                                 10   

                                                  262
Error. We are trying to
          1 0
     minimise this 9 
                    
                                  e1   45 
                                    
           1   1    5           e2   57 
           1   0   10           e   45 
                                3  
           1   2   16           e4   51 
           1   4       b0   e   65 
                    10  
                       b1    5    
           1   4   20    e6   88 
                       b2   e   
           1   1   11           7   44 
           1   4   20           e8   87 
                                  
           1   3   15           e9   89 
           1   0   15           e   59 
                                10   

                                                 263
  1 0      9           e1   45 
                         
  1 1      5           e2   57 
  1 0     10           e   45 
                       3  
  1 2     16           e4   51 
  1 4         b0   e   65 
           10  
              b1    5    
  1 4     20    e6   88 
              b2   e   
  1 1     11           7   44 
  1 4     20           e8   87 
                         
  1 3     15           e9   89 
  1 0
The DV     grade  e10   59 
         - 15 
                         

                                        264
• Y=BX+E
• Simple way of representing as many IVs as
  you like
Y = b0x0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + e
                                               b0 
                                               
                                               b1 
         x01   x11   x21   x31   x41   x51  b2   e1 
                                               
        x
         02    x12   x22   x32   x42   x52  b3   e2 
                                                     
                                              b 
                                               4
                                              b 
                                               5
                                                             265
                                        b0 
                                        
                                        b1 
 x01 x11 x21 x31 x41            x51  b2   e1 
                                        
x
 02 x12 x22 x32 x42             x52  b3   e2 
                                              
                                       b 
                                        4
                                       b 
                                        5
 b0 x0  b1 x1  ...bk xk  e
                                                266
  Generalises to Multivariate
            Case
• Y=BX+E
• Y, B and E
  – Matrices, not vectors
• Goes beyond this course
  – (Do Jacques Tacq‟s course for more)
  – (Or read his book)


                                          267
268
269
270
Lesson 6: More on Multiple
       Regression




                             271
      Parameter Estimates
• Parameter estimates (b1, b2 … bk) were
  standardised
  – Because we analysed a correlation matrix
• Represent the correlation of each IV
  with the DV
  – When all other IVs are held constant



                                               272
• Can also be unstandardised
• Unstandardised represent the unit
  change in the DV associated with a 1
  unit change in the IV
  – When all the other variables are held
    constant
• Parameters have standard errors
  associated with them
  – As with one IV
  – Hence t-test, and associated probability
    can be calculated
    • Trickier than with one IV
                                               273
 Standard Error of Regression
         Coefficient
• Standardised is easier

                 1 R      1 2
         SEi               Y
                n  k 1 1 R i
                             2


  – R2i is the value of R2 when all other predictors are
    used as predictors of that variable
     • Note that if R2i = 0, the equation is the same as for
       previous


                                                          274
                Multiple R

• The degree of prediction
  – R (or Multiple R)
  – No longer equal to b
• R2 Might be equal to the sum of squares
  of B
  – Only if all x‟s are uncorrelated



                                       275
       In Terms of Variance
• Can also think of this in terms of
  variance explained.
  – Each IV explains some variance in the DV
  – The IVs share some of their variance
• Can‟t share the same variance twice




                                               276
   Variance in Y
accounted for by x1
   rx1y2 = 0.36




                         Variance in Y
    The total
  variance of Y       accounted for by x2
                         rx2y2 = 0.36
       =1
                                       277
• In this model
  – R2 = ryx12 + ryx22
  – R2 = 0.36 + 0.36 = 0.72
  – R = 0.72 = 0.85
• But
  – If x1 and x2 are correlated
  – No longer the case




                                  278
   Variance in Y
accounted for by x1
    rx1y2 = 0.36       Variance shared
                      between x1 and x2
                      (not equal to rx1x2)




    The total            Variance in Y
  variance of Y       accounted for by x2
       =1                rx2y2 = 0.36
                                     279
• So
  – We can no longer sum the r2
  – Need to sum them, and subtract the
    shared variance – i.e. the correlation
• But
  – It‟s not the correlation between them
  – It‟s the correlation between them as a
    proportion of the variance of Y
• Two different ways

                                             280
• Based on estimates

           2
        R  b1ryx1  b2ryx2

• If rx1x2 = 0
  – rxy = bx1
  – Equivalent to ryx12 + ryx22



                                  281
 • Based on correlations


              2      2
     2
          r         r      2ryx1 ryx2 rx1 x2
  R 
              yx1    yx2
                               2
                         1r  x1 x2




• rx1x2 = 0
  – Equivalent to ryx12 + ryx22
                                                 282
• Can also be calculated using methods
  we have seen
  – Based on PRE
  – Based on correlation with prediction
• Same procedure with >2 IVs




                                           283
             Adjusted R2
• R2 is an overestimate of population
  value of R2
  – Any x will not correlate 0 with Y
  – Any variation away from 0 increases R
  – Variation from 0 more pronounced with
    lower N
• Need to correct R2
  – Adjusted R2

                                            284
 • Calculation of Adj. R2


           2            N 1    2
 Adj. R  1  (1  R )
                       N  k 1

• 1 – R2
  – Proportion of unexplained variance
  – We multiple this by an adjustment
     • More variables – greater adjustment
     • More people – less adjustment
                                             285
             Shrunken R2
• Some authors treat shrunken and
  adjusted R2 as the same thing
  – Others don‟t




                                    286
  N 1           N  20, k  3
                  20  1     19
                                1.1875
 N  k 1        20  3  1 16




N  10, k  8     N  10, k  3
 10  1    9       10  1    9
           9                1.5
10  8  1 1      10  3  1 6

                                      287
           Extra Bits

• Some stranger things that can
  happen
  – Counter-intuitive


                                  288
       Suppressor variables
• Can be hard to understand
  – Very counter-intuitive
• Definition
  – An independent variable which increases
    the size of the parameters associated with
    other independent variables above the size
    of their correlations


                                            289
• An example (based on Horst, 1941)
  – Success of trainee pilots
  – Mechanical ability (x1), verbal ability (x2),
    success (y)
• Correlation matrix

               Mech             Verb    Success
       Mech             1        0.5        0.3
        Verb          0.5          1          0
     Success          0.3          0          1

                                                    290
– Mechanical ability correlates 0.3 with
  success
– Verbal ability correlates 0.0 with success
– What will the parameter estimates be?
– (Don‟t look ahead until you have had a
  guess)




                                               291
• Mechanical ability
  – b = 0.4
  – Larger than r!
• Verbal ability
  – b = -0.2
  – Smaller than r!!
• So what is happening?
  – You need verbal ability to do the test
  – Not related to mechanical ability
     • Measure of mechanical ability is contaminated
       by verbal ability

                                                       292
• High mech, low verbal
  – High mech
    • This is positive
  – Low verbal
    • Negative, because we are talking about
      standardised scores
    • Your mech is really high – you did well on the
      mechanical test, without being good at the
      words
• High mech, high verbal
  – Well, you had a head start on mech,
    because of verbal, and need to be brought
    down a bit
                                                       293
Another suppressor?
         x1    x2     y
 x1       1    0.5   0.3
 x2      0.5    1    0.2
 y       0.3   0.2    1

  b1 =
  b2 =
                           294
 Another suppressor?
     x1    x2     y
x1    1    0.5   0.3
x2   0.5    1    0.2
y    0.3   0.2    1


b1 =0.26
b2 = -0.06
                       295
     And another?
       x1     x2     y
x1      1     0.5   0.3
x2     0.5     1    -0.2
 y     0.3   -0.2    1

 b1 =
 b2 =
                           296
     And another?
     x1     x2     y
x1    1     0.5   0.3
x2   0.5     1    -0.2
 y   0.3   -0.2    1


b1 = 0.53
b2 = -0.47
                         297
     One more?
      x1     x2     y
x1     1    -0.5   0.3
x2   -0.5     1    0.2
y     0.3    0.2    1

 b1 =
 b2 =
                         298
     One more?
      x1     x2     y
x1     1    -0.5   0.3
x2   -0.5     1    0.2
y     0.3    0.2    1


b1 = 0.53
b2 = 0.47
                         299
• Suppression happens when two opposing
  forces are happening together
   – And have opposite effects
• Don‟t throw away your IVs,
   – Just because they are uncorrelated with the DV
• Be careful in interpretation of regression
  estimates
   – Really need the correlations too, to interpret what
     is going on
   – Cannot compare between studies with different
     IVs


                                                       300
  Standardised Estimates > 1
• Correlations are bounded
                -1.00 ≤ r ≤ +1.00
  – We think of standardised regression
    estimates as being similarly bounded
    • But they are not
  – Can go >1.00, <-1.00
  – R cannot, because that is a proportion of
    variance
                                                301
• Three measures of ability
  – Mechanical ability, verbal ability 1, verbal
    ability 2
  – Score on science exam
             Mech         Verbal1   Verbal2   Scores
     Mech             1       0.1       0.1      0.6
   Verbal1          0.1         1       0.9      0.6
   Verbal2          0.1       0.9         1      0.3
    Scores          0.6       0.6       0.3        1

    –Before reading on, what are the parameter
     estimates?
                                                   302
         Mech                0.56
       Verbal1               1.71
       Verbal2              -1.29
• Mechanical
  – About where we expect
• Verbal 1
  – Very high
• Verbal 2
  – Very low
                                    303
• What is going on
  – It‟s a suppressor again
  – An independent variable which increases
    the size of the parameters associated with
    other independent variables above the size
    of their correlations
• Verbal 1 and verbal 2 are correlated so
  highly
  – They need to cancel each other out



                                             304
         Variable Selection
• What are the appropriate independent
  variables to use in a model?
  – Depends what you are trying to do
• Multiple regression has two separate
  uses
  – Prediction
  – Explanation

                                         305
• Prediction                • Explanation
  – What will happen in       – Why did something
    the future?                 happen?
  – Emphasis on               – Emphasis on
    practical application       understanding
  – Variables selected          phenomena
    (more) empirically        – Variables selected
  – Value free                  theoretically
                              – Not value free




                                                     306
• Visiting the doctor
   – Precedes suicide attempts
   – Predicts suicide
      • Does not explain suicide
• More on causality later on …
• Which are appropriate variables
   – To collect data on?
   – To include in analysis?
   – Decision needs to be based on theoretical knowledge
     of the behaviour of those variables
   – Statistical analysis of those variables (later)
      • Unless you didn‟t collect the data
   – Common sense (not a useful thing to say)
                                                    307
   Variable Entry Techniques
• Entry-wise
  – All variables entered simultaneously
• Hierarchical
  – Variables entered in a predetermined order
• Stepwise
  – Variables entered according to change in
    R2
  – Actually a family of techniques

                                               308
• Entrywise
  – All variables entered simultaneously
  – All treated equally
• Hierarchical
  – Entered in a theoretically determined order
  – Change in R2 is assessed, and tested for
    significance
  – e.g. sex and age
     • Should not be treated equally with other
       variables
     • Sex and age MUST be first
  – Confused with hierarchical linear modelling
                                                  309
• Stepwise
  – Variables entered empirically
  – Variable which increases R2 the most goes
    first
    • Then the next …
  – Variables which have no effect can be
    removed from the equation
• Example
  – IVs: Sex, age, extroversion,
  – DV: Car – how long someone spends
    looking after their car

                                                310
• Correlation Matrix


        SEX           AGE           EXTRO CAR
SEX            1.00         -0.05       0.40  0.66
AGE           -0.05          1.00       0.40  0.23
EXTRO          0.40          0.40       1.00  0.67
CAR            0.66          0.23       0.67  1.00




                                                 311
• Entrywise analysis
  – r2 = 0.64

                   b       p
     SEX          0.49   <0.01
     AGE          0.08    0.46
     EXTRO        0.44   <0.01




                                 312
• Stepwise Analysis
  – Data determines the order
  – Model 1: Extroversion, R2 = 0.450
  – Model 2: Extroversion + Sex, R2 = 0.633


                    b             p
    EXTRO          0.48         <0.01
     SEX           0.47         <0.01


                                              313
• Hierarchical analysis
  – Theory determines the order
  – Model 1: Sex + Age, R2 = 0.510
  – Model 2: S, A + E, R2 = 0.638
  – Change in R2 = 0.128, p = 0.001


             SEX          0.49   <0.01
    2        AGE          0.08   0.46
            EXTRO         0.44   <0.01


                                         314
• Which is the best model?
  – Entrywise – OK
  – Stepwise – excluded age
     • Did have a (small) effect
  – Hierarchical
     • The change in R2 gives the best estimate of the
       importance of extroversion
• Other problems with stepwise
  – F and df are wrong (cheats with df)
  – Unstable results
     • Small changes (sampling variance) – large
       differences in models

                                                    315
– Uses a lot of paper
– Don‟t use a stepwise procedure to pack
  your suitcase




                                           316
    Is Stepwise Always Evil?
• Yes
• All right, no
• Research goal is predictive (technological)
  – Not explanatory (scientific)
  – What happens, not why
• N is large
  – 40 people per predictor, Cohen, Cohen, Aiken,
    West (2003)
• Cross validation takes place
                                            317
        A quick note on R2
R2 is sometimes regarded as the „fit‟ of a
 regression model
  – Bad idea
• If good fit is required – maximise R2
  – Leads to entering variables which do not
    make theoretical sense



                                               318
Critique of Multiple Regression
• Goertzel (2002)
  – “Myths of murder and multiple regression”
  – Skeptical Inquirer (Paper B1)
• Econometrics and regression are „junk
  science‟
  – Multiple regression models (in US)
  – Used to guide social policy

                                            319
     More Guns, Less Crime
  – (controlling for other factors)
• Lott and Mustard: A 1% increase in gun
  ownership
  – 3.3% decrease in murder rates
• But:
  – More guns in rural Southern US
  – More crime in urban North (crack cocaine
    epidemic at time of data)

                                               320
      Executions Cut Crime
• No difference between crimes in states
  in US with or without death penalty
• Ehrlich (1975) controlled all variables
  that effect crime rates
  – Death penalty had effect in reducing crime
    rate
• No statistical way to decide who‟s right

                                             321
          Legalised Abortion
• Donohue and Levitt (1999)
  – Legalised abortion in 1970‟s cut crime in 1990‟s
• Lott and Whitley (2001)
  – “Legalising abortion decreased murder rates by …
    0.5 to 7 per cent.”
• It‟s impossible to model these data
  – Controlling for other historical events
  – Crack cocaine (again)


                                                       322
            Another Critique
• Berk (2003)
  – Regression analysis: a constructive critique (Sage)
• Three cheers for regression
  – As a descriptive technique
• Two cheers for regression
  – As an inferential technique
• One cheer for regression
  – As a causal analysis

                                                     323
     Is Regression Useless?
• Do regression carefully
  – Don‟t go beyond data which you have a
    strong theoretical understanding of
• Validate models
  – Where possible, validate predictive power
    of models in other areas, times, groups
     • Particularly important with stepwise


                                                324
 Lesson 7: Categorical
Independent Variables




                         325
Introduction




               326
             Introduction
• So far, just looked at continuous
  independent variables
• Also possible to use categorical
  (nominal, qualitative) independent
  variables
  – e.g. Sex; Job; Religion; Region; Type (of
    anything)
• Usually analysed with t-test/ANOVA

                                                327
            Historical Note
• But these (t-test/ANOVA) are special
  cases of regression analysis
  – Aspects of General Linear Models (GLMs)
• So why treat them differently?
  – Fisher‟s fault
  – Computers‟ fault
• Regression, as we have seen, is
  computationally difficult
  – Matrix inversion and multiplication
  – Unfeasible, without a computer
                                              328
• In the special cases where:
     • You have one categorical IV
     • Your IVs are uncorrelated
  – It is much easier to do it by partitioning of
    sums of squares
• These cases
  – Very rare in „applied‟ research
  – Very common in „experimental‟ research
     • Fisher worked at Rothamsted agricultural
       research station
     • Never have problems manipulating wheat, pigs,
       cabbages, etc

                                                  329
• In psychology
  – Led to a split between „experimental‟
    psychologists and „correlational‟
    psychologists
  – Experimental psychologists (until recently)
    would not think in terms of continuous
    variables
• Still (too) common to dichotomise a
  variable
  – Too difficult to analyse it properly
  – Equivalent to discarding 1/3 of your data
                                                330
The Approach




               331
              The Approach
• Recode the nominal variable
  – Into one, or more, variables to represent that
    variable
• Names are slightly confusing
  – Some texts talk of „dummy coding‟ to refer to all
    of these techniques
  – Some (most) refer to „dummy coding‟ to refer to
    one of them
  – Most have more than one name

                                                        332
• If a variable has g possible categories it
  is represented by g-1 variables
• Simplest case:
  – Smokes: Yes or No
  – Variable 1 represents „Yes‟
  – Variable 2 is redundant
     • If it isn‟t yes, it‟s no




                                           333
The Techniques




                 334
• We will examine two coding schemes
  – Dummy coding
     • For two groups
     • For >2 groups
  – Effect coding
     • For >2 groups
• Look at analysis of change
  – Equivalent to ANCOVA
  – Pretest-posttest designs


                                       335
   Dummy Coding – 2 Groups
• Also called simple coding by SPSS
• A categorical variable with two groups
• One group chosen as a reference group
  – The other group is represented in a variable
• e.g. 2 groups: Experimental (Group 1) and
  Control (Group 0)
  – Control is the reference group
  – Dummy variable represents experimental group
     • Call this variable „group1‟



                                                   336
• For variable „group1‟
  – 1 = „Yes‟, 2=„No‟



        Original         New
        Category        Variable
          Exp              1
          Con              0


                                   337
• Some data
• Group is x, score is y

                   Control      Experimental
                   Group           Group
 Experiment 1                10           10
 Experiment 2                10           20
 Experiment 3                10           30




                                           338
• Control Group = 0
  – Intercept = Score on Y when x = 0
  – Intercept = mean of control group
• Experimental Group = 1
  – b = change in Y when x increases 1 unit
  – b = difference between experimental
   group and control group




                                              339
 35
 30
  Gradient of slope
 25
represents difference
 20between means

 15
 10
  5
  0
Control Group                      Experimental Group

  Experiment 1      Experiment 2       Experiment 3

                                                        340
 Dummy Coding – 3+ Groups
• With three groups the approach is the
  similar
• g = 3, therefore g-1 = 2 variables
  needed
• 3 Groups
  – Control
  – Experimental Group 1
  – Experimental Group 2

                                          341
    Original
                       Gp1             Gp2
    Category
      Con                0               0
      Gp1                1               0
      Gp2                0               1

• Recoded into two variables
  – Note – do not need a 3rd variable
    • If we are not in group 1 or group 2 MUST be in
      control group
    • 3rd variable would add no information
    • (What would happen to determinant?)
                                                  342
• F and associated p
  – Tests H0 that


           g1  g2  g3
• b1 and b2 and associated p-values
  – Test difference between each experimental
    group and the control group
• To test difference between
  experimental groups
  – Need to rerun analysis

                                           343
• One more complication
  – Have now run multiple comparisons
  – Increases a – i.e. probability of type I error
• Need to correct for this
  – Bonferroni correction
  – Multiply given p-values by two/three
    (depending how many comparisons were
    made)




                                                344
                 Effect Coding
• Usually used for 3+ groups
• Compares each group (except the reference
  group) to the mean of all groups
  – Dummy coding compares each group to the
    reference group.
• Example with 5 groups
  – 1 group selected as reference group
     • Group 5


                                              345
• Each group (except reference) has a
  variable
  – 1 if the individual is in that group
  – 0 if not
  – -1 if in reference group

group   group_1 group_2 group_3 group_4
  1         1       0       0       0
  2         0       1       0       0
  3         0       0       1       0
  4         0       0       0       1
  5        -1      -1      -1      -1
                                           346
             Examples
• Dummy coding and Effect Coding
• Group 1 chosen as reference group
  each time
• Data      Group       Mean       SD
               1          52.40     4.60
               2          56.30     5.70
               3          60.10     5.00
             Total        56.27     5.88
                                      347
• Dummy

        Group   dummy2    dummy3
            1     0         0
            2     1         0
            3     0         1

 • Effect
        Group   Effect2   effect3
            1     -1        -1
            2     1         0
            3     0         1       348
Dummy                  Effect
R=0.543, F=5.7,        R=0.543, F=5.7, df=2,
   df=2, 27, p=0.009     27, p=0.009
b0 = 52.4,             b0 = 56.27,
b1 = 3.9, p=0.100      b1 = 0.03, p=0.980
b2 = 7.7, p=0.002      b2 = 3.8, p=0.007

b0  g1                b0  G
b1  g2  g1           b1  g2  G
b2  g3  g1           b2  g3  G        349
                   In SPSS
• SPSS provides two equivalent procedures for
  regression
  – Regression (which we have been using)
  – GLM (which we haven‟t)
• GLM will:
  – Automatically code categorical variables
  – Automatically calculate interaction terms
• GLM won‟t:
  – Give standardised effects
  – Give hierarchical R2 p-values
  – Allow you to not understand

                                                350
ANCOVA and Regression




                        351
• Test
  – (Which is a trick; but it‟s designed to make
    you think about it)
• Use employee data.sav
  – Compare the pay rise (difference between
    salbegin and salary)
  – For ethnic minority and non-minority staff
     • What do you find?
                                              352
    ANCOVA and Regression
• Dummy coding approach has one special use
  – In ANCOVA, for the analysis of change
• Pre-test post-test experimental design
  – Control group and (one or more) experimental
    groups
  – Tempting to use difference score + t-test / mixed
    design ANOVA
  – Inappropriate


                                                    353
• Salivary cortisol levels
  – Used as a measure of stress
  – Not absolute level, but change in level over
    day may be interesting
• Test at: 9.00am, 9.00pm
• Two groups
  – High stress group (cancer biopsy)
     • Group 1
  – Low stress group (no biopsy)
     • Group 0


                                              354
              AM         PM       Diff
High Stress   20.1        6.8     13.3
Low Stress    22.3       11.8     10.5


• Correlation of AM and PM = 0.493
  (p=0.008)
• Has there been a significant difference
  in the rate of change of salivary
  cortisol?
  – 3 different approaches
                                            355
• Approach 1 – find the differences, do a
  t-test
  – t = 1.31, df=26, p=0.203
• Approach 2 – mixed ANOVA, look for
  interaction effect
  – F = 1.71, df = 1, 26, p = 0.203
  – F = t2
• Approach 3 – regression (ANCOVA)
  based approach

                                        356
  – IVs: AM and group
  – DV: PM
  – b1 (group) = 3.59, standardised b1=0.432,
    p = 0.01
• Why is the regression approach better?
  – The other two approaches took the
    difference
  – Assumes that r = 1.00
  – Any difference from r = 1.00 and you add
    error variance
    • Subtracting error is the same as adding error

                                                      357
• Using regression
  – Ensures that all the variance that is
    subtracted is true
  – Reduces the error variance
• Two effects
  – Adjusts the means
     • Compensates for differences between groups
  – Removes error variance




                                                    358
                In SPSS
• SPSS automates all of this
  – But you have to understand it, to know
    what it is doing
• Use Analyse, GLM, Univariate ANOVA




                                             359
                Outcome here




                   Categorical
                 predictors here


                  Continuous
                predictors here




Click options                     360
Select parameter
   estimaters




              361
           More on Change
• If difference score is correlated with
  either pre-test or post-test
  – Subtraction fails to remove the difference
    between the scores
  – If two scores are uncorrelated
     • Difference will be correlated with both
     • Failure to control
  – Equal SDs, r = 0
     • Correlation of change and pre-score =0.707

                                                    362
      Even More on Change
• A topic of surprising complexity
  – What I said about difference scores isn‟t
    always true
     • Lord‟s paradox – it depends on the precise
       question you want to answer
  – Collins and Horn (1993). Best methods for
    the analysis of change
  – Collins and Sayer (2001). New methods for
    the analysis of change.

                                                    363
Lesson 8: Assumptions in
   Regression Analysis




                           364
          The Assumptions
1. The distribution of residuals is normal (at
   each value of the dependent variable).
2. The variance of the residuals for every set
   of values for the independent variable is
   equal.
     • violation is called heteroscedasticity.
3. The error term is additive
  •   no interactions.
4. At every value of the dependent variable
   the expected (mean) value of the residuals
   is zero
  •   No non-linear relationships                365
5. The expected correlation between residuals,
   for any two cases, is 0.
  •   The independence assumption (lack of
      autocorrelation)
6. All independent variables are uncorrelated
   with the error term.
7. No independent variables are a perfect
   linear function of other independent
   variables (no perfect multicollinearity)
8. The mean of the error term is zero.



                                                366
  What are we going to do …
• Deal with some of these assumptions in
  some detail
• Deal with others in passing only
  – look at them again later on




                                      367
     Assumption 1: The
 Distribution of Residuals is
Normal at Every Value of the
    Dependent Variable



                            368
 Look at Normal Distributions
• A normal distribution
  – symmetrical, bell-shaped (so they say)




                                             369
         What can go wrong?
• Skew
  – non-symmetricality
  – one tail longer than the other
• Kurtosis
  – too flat or too peaked
  – kurtosed
• Outliers
  – Individual cases which are far from the
    distribution
                                              370
         Effects on the Mean
• Skew
  – biases the mean, in direction of skew
• Kurtosis
  – mean not biased
  – standard deviation is
  – and hence standard errors, and
    significance tests



                                            371
       Examining Univariate
          Distributions
•   Histograms
•   Boxplots
•   P-P plots
•   Calculation based methods



                                372
               Histograms
30
     A and B        30




20                  20




10                  10




0                   0




                            373
     • C and D
40               14



                 12


30
                 10



                 8

20

                 6



                 4
10


                 2


0                0




                      374
•E&F

       20




       10




       0




            375
    Histograms can be tricky ….
7            6               6



6
             5               5


5
             4               4

4
             3               3

3

             2               2
2


             1               1
1


0            0               0




7            7               6



6            6
                             5


5            5
                             4

4            4
                             3

3            3

                             2
2            2


                             1
1            1


0            0               0




                                  376
Boxplots




           377
•A&B                        P-P Plots
1.00                                 1.00




 .75                                  .75




 .50                                  .50




 .25                                  .25




0.00                                 0.00
   0.00   .25   .50   .75     1.00      0.00   .25   .50   .75         1.00




                                                                 378
•C&D
1.00                               1.00




 .75                                .75




 .50                                .50




 .25                                .25




0.00                               0.00
   0.00   .25   .50   .75   1.00      0.00   .25   .50   .75    1.00




                                                               379
•E&F
1.00                               1.00




 .75                                .75




 .50                                .50




 .25                                .25




0.00                               0.00
   0.00   .25   .50   .75   1.00      0.00   .25   .50   .75   1.00




                                                                380
         Calculation Based
• Skew and Kurtosis statistics
• Outlier detection statistics




                                 381
  Skew and Kurtosis Statistics
• Normal distribution
  – skew = 0
  – kurtosis = 0
• Two methods for calculation
  – Fisher‟s and Pearson‟s
  – Very similar answers
• Associated standard error
  – can be used for significance of departure from
    normality
  – not actually very useful
     • Never normal above N = 400                    382
    Skewness SE Skew Kurtosis SE Kurt

A       -0.12   0.172   -0.084   0.342
B       0.271   0.172    0.265   0.342
C       0.454   0.172    1.885   0.342
D       0.117   0.172   -1.081   0.342
E       2.106   0.172     5.75   0.342
F       0.171   0.172    -0.21   0.342




                                         383
           Outlier Detection
• Calculate distance from mean
  – z-score (number of standard deviations)
  – deleted z-score
     • that case biased the mean, so remove it
  – Look up expected distance from mean
     • 1% 3+ SDs
• Calculate influence
  – how much effect did that case have on the mean?


                                                 384
Non-Normality in Regression




                          385
   Effects on OLS Estimates
• The mean is an OLS estimate
• The regression line is an OLS estimate
• Lack of normality
  – biases the position of the regression slope
  – makes the standard errors wrong
     • probability values attached to statistical
       significance wrong



                                                    386
       Checks on Normality
• Check residuals are normally distributed
  – SPSS will draw histogram and p-p plot of
    residuals
• Use regression diagnostics
  – Lots of them
  – Most aren‟t very interesting



                                               387
      Regression Diagnostics
• Residuals
   – standardised, unstandardised, studentised,
     deleted, studentised-deleted
   – look for cases > |3| (?)
• Influence statistics
   – Look for the effect a case has
   – If we remove that case, do we get a different
     answer?
   – DFBeta, Standardised DFBeta
      • changes in b

                                                     388
  – DfFit, Standardised DfFit
    • change in predicted value
  – Covariance ratio
    • Ratio of the determinants of the covariance
      matrices, with and without the case
• Distances
  – measures of „distance‟ from the centroid
  – some include IV, some don‟t




                                                    389
         More on Residuals
• Residuals are trickier than you might
  have imagined
• Raw residuals
  – OK
• Standardised residuals
  – Residuals divided by SD

                        e    2

             se 
                      n  k 1            390
                 Leverage
• But
  – That SD is wrong
  – Variance of the residuals is not equal
     • Those further from the centroid on the
       predictors have higher variance
     • Need a measure of this
• Distance from the centroid is leverage,
  or h (or sometimes hii)
• One predictor
  – Easy
                                                391
             1
         hi  
                 xi  x                  2


             n ( x  x )  2



• Minimum hi is 1/n, the maximum is 1
• Except
  – SPSS uses standardised leverage - h*
    • It doesn‟t tell you this, it just uses it




                                                  392
                 1
      h i  hi 
       *

                 n

      hi 
        *    xi  x 2


            ( x  x ) 2




• Minimum 0, maximum (N-1/N)




                               393
• Multiple predictors
  – Calculate the hat matrix (H)
  – Leverage values are the diagonals of this
    matrix
                               1
        H  X(X' X) X'
  – Where X is the augmented matrix of
    predictors (i.e. matrix that includes the
    constant)
  – Hence leverage hii – element ii of H
                                                394
    • Example of calculation of hat matrix
                                          1
                                                        
   1 15   1 15   1              15      1 15   0.318 0.273         
                                                                    
   1 20   1 20   1              20      1 20   0.273 0.236         
H           ... ...    ...                      
    ... ...                          ...      ... ...                    
   1 65   1 65   1
                                     
                                              1 65  
                                                                           
                                 65                            0.318 
                                                                              
                                           




                                                                         395
  Standardised / Studentised
• Now we can calculate the standardised
  residuals
  – SPSS calls them studentised residuals
  – Also called internally studentised residuals

                     ei
            ei 
                  se 1  hi
                                               396
Deleted Studentised Residuals
• Studentised residuals do not have a
  known distribution
  – Cannot use them for inference
• Deleted studentised residuals
  – Externally studentised residuals
  – Jackknifed residuals
     • Distributed as t
     • With df = N – k – 1
                                        397
        Testing Significance
• We can calculate the probability of a
  residual
  – Is it sampled from the same population
• BUT
  – Massive type I error rate
  – Bonferroni correct it
     • Multiply p value by N


                                             398
        Bivariate Normality
• We didn‟t just say “residuals normally
  distributed”
• We said “at every value of the
  dependent variables”
• Two variables can be normally
  distributed – univariate,
  – but not bivariate

                                           399
     • Couple‟s IQs
            – male and female
    FEMALE                                                                          MALE
8                                                                               6



                                                                                5

6

                                                                                4



4                                                                               3



                                                                                2

2
                                                            Frequency


                                                                                1


0                                                                               0
    60.0   70.0   80.0    90.0   100.0   110.0   120.0   130.0          140.0       60.0   70.0   80.0   90.0   100.0   110.0   120.0   130.0   140.0




                         –Seem reasonably normal
                                                                                                                                            400
   • But wait!!
       160



       140



       120



       100



        80



        60
MALE




        40
          40      60   80   100   120   140   160


             FEMALE


                                                    401
• When we look at bivariate normality
  – not normal – there is an outlier
• So plot X against Y
• OK for bivariate
  – but – may be a multivariate outlier
  – Need to draw graph in 3+ dimensions
  – can‟t draw a graph in 3 dimensions
• But we can look at the residuals instead
  …

                                          402
• IQ histogram of residuals
   12



   10



    8



    6



    4



    2



    0


                              403
     Multivariate Outliers …
• Will be explored later in the exercises

• So we move on …




                                            404
    What to do about Non-
          Normality
• Skew and Kurtosis
  – Skew – much easier to deal with
  – Kurtosis – less serious anyway
• Transform data
  – removes skew
  – positive skew – log transform
  – negative skew - square

                                      405
             Transformation
• May need to transform IV and/or DV
  – More often DV
     • time, income, symptoms (e.g. depression) all positively
       skewed
  – can cause non-linear effects (more later) if only
    one is transformed
  – alters interpretation of unstandardised parameter
  – May alter meaning of variable
  – May add / remove non-linear and moderator
    effects
                                                             406
• Change measures
  – increase sensitivity at ranges
     • avoiding floor and ceiling effects
• Outliers
  – Can be tricky
  – Why did the outlier occur?
     • Error? Delete them.
     • Weird person? Probably delete them
     • Normal person? Tricky.




                                            407
  – You are trying to model a process
    • is the data point „outside‟ the process
    • e.g. lottery winners, when looking at salary
    • yawn, when looking at reaction time


  – Which is better?
    • A good model, which explains 99% of your
      data?
    • A poor model, which explains all of it
• Pedhazur and Schmelkin (1991)
  – analyse the data twice

                                                     408
• We will spend much less time on the
  other 6 assumptions
• Can do exercise 8.1.




                                        409
Assumption 2: The variance of
  the residuals for every set of
   values for the independent
        variable is equal.



                             410
        Heteroscedasticity
• This assumption is a about
  heteroscedasticity of the residuals
  – Hetero=different
  – Scedastic = scattered
• We don‟t want heteroscedasticity
  – we want our data to be homoscedastic
• Draw a scatterplot to investigate

                                           411
       160



       140



       120



       100



        80



        60
MALE




        40
          40      60   80   100   120   140   160
                                              412
             FEMALE
• Only works with one IV
  – need every combination of IVs
• Easy to get – use predicted values
  – use residuals there
• Plot predicted values against residuals
  – or   standardised residuals
  – or   deleted residuals
  – or   standardised deleted residuals
  – or   studentised residuals
• A bit like turning the scatterplot on its
  side
                                              413
     Good – no heteroscedasticity




Predicted Value



                                    414
         Bad – heteroscedasticity




Predicted Value


                                    415
    Testing Heteroscedasticity
•    White‟s test
    –    Not automatic in SPSS (is in SAS)
    –    Luckily, not hard to do
    1.   Do regression, save residuals.
    2.   Square residuals
    3.   Square IVs
    4.   Calculate interactions of IVs
         – e.g. x1•x2, x1•x3, x2 • x3
                                             416
    5. Run regression using
        – squared residuals as DV
        – IVs, squared IVs, and interactions as IVs
    6. Test statistic = N x R2
        – Distributed as c2
        – Df = k (for second regression)
•   Use education and salbegin to predict
    salary (employee data.sav)
    –   R2 = 0.113, N=474, c2 = 53.5, df=5, p <
        0.0001


                                                      417
8
       Plot of Pred and Res
6



4



2



0



-2


-4
  -2           0          2         4          6   8


     Regression Standardized Predicted Value


                                                       418
              Magnitude of
            Heteroscedasticity
• Chop data into “slices”
  – 5 slices, based on X (or predicted score)
     • Done in SPSS
  – Calculate variance of each slice
  – Check ratio of smallest to largest
  – Less than 10:1
     • OK


                                                419
         The Visual Bander
• New in SPSS 12




                             420
• Variances of the 5 groups
   1                           .219
   2                           .336
   3                           .757
   4                           .751
   5                          3.119



• We have a problem
  – 3 / 0.2 ~= 15
                                      421
            Dealing with
          Heteroscedasticity
•   Use Huber-White estimates
    – Very easy in Stata
    – Fiddly in SPSS – bit of a hack
•   Use Complex samples
    1. Create a new variable where all cases are
       equal to 1, call it const
    2. Use Complex Samples, Prepare for
       Analysis
    3. Create a plan file

                                              422
4.   Sample weight is const
5.   Finish
6.   Use Complex Samples, GLM
7.   Use plan file created, and set up
     model as in GLM
     (More on complex samples later)

In Stata, do regression as normal, and
   click “robust”.

                                         423
    Heteroscedasticity –
  Implications and Meanings
Implications
• What happens as a result of
  heteroscedasticity?
  – Parameter estimates are correct
    • not biased
  – Standard errors (hence p-values) are
    incorrect

                                           424
               However …
• If there is no skew in predicted scores
  – P-values a tiny bit wrong
• If skewed,
  – P-values very wrong
• Can do exercise



                                            425
Meaning
• What is heteroscedasticity trying to tell
  us?
  – Our model is wrong – it is misspecified
  – Something important is happening that we
    have not accounted for
• e.g. amount of money given to charity
  (given)
  – depends on:
     • earnings
     • degree of importance person assigns to the
       charity (import)
                                                    426
• Do the regression analysis
  – R2 = 0.60, F=31.4, df=2, 37, p < 0.001
     • seems quite good
  – b0 = 0.24, p=0.97
  – b1 = 0.71, p < 0.001
  – b2 = 0.23, p = 0.031
• White‟s test
  – c2 = 18.6, df=5, p=0.002
• The plot of predicted values against
  residuals …
                                             427
• Plot shows heteroscedastic relationship
                                        428
• Which means …
  – the effects of the variables are not additive
  – If you think that what a charity does is
    important
     • you might give more money
     • how much more depends on how much money
       you have




                                               429
        70



        60



        50



        40



        30


                                              Earnings
        20
GIVEN




                                                High

        10                                      Low
             4    6   8   10   12   14   16


             IMPORT
                                                         430
• One more thing about
  heteroscedasticity
  – it is the equivalent of homogeneity of
    variance in ANOVA/t-tests




                                             431
Assumption 3: The Error Term
         is Additive




                           432
                  Additivity
• What heteroscedasticity shows you
  – effects of variables need to be additive
• Heteroscedasticity doesn‟t always show it to
  you
  – can test for it, but hard work
  – (same as homogeneity of covariance assumption
    in ANCOVA)
• Have to know it from your theory
• A specification error

                                                 433
      Additivity and Theory
• Two IVs
  – Alcohol has sedative effect
     • A bit makes you a bit tired
     • A lot makes you very tired
  – Some painkillers have sedative effect
     • A bit makes you a bit tired
     • A lot makes you very tired
  – A bit of alcohol and a bit of painkiller
    doesn‟t make you very tired
  – Effects multiply together, don‟t add
    together
                                               434
• If you don‟t test for it
  – It‟s very hard to know that it will happen
• So many possible non-additive effects
  – Cannot test for all of them
  – Can test for obvious
• In medicine
  – Choose to test for salient non-additive
    effects
  – e.g. sex, race


                                                 435
Assumption 4: At every value of
    the dependent variable the
   expected (mean) value of the
         residuals is zero




                              436
                     Linearity
• Relationships between variables should be
  linear
  – best represented by a straight line
• Not a very common problem in social
  sciences
  – except economics
  – measures are not sufficiently accurate to make a
    difference
     • R2 too low
     • unlike, say, physics


                                                   437
• Relationship between speed of travel
  and fuel used
   Fuel




          Speed


                                         438
• R2 = 0.938
  – looks pretty good
  – know speed, make a good prediction of
    fuel
• BUT
  – look at the chart
  – if we know speed we can make a perfect
    prediction of fuel used
  – R2 should be 1.00


                                             439
     Detecting Non-Linearity
• Residual plot
  – just like heteroscedasticity
• Using this example
  – very, very obvious
  – usually pretty obvious




                                   440
Residual plot




                441
 Linearity: A Case of Additivity
• Linearity = additivity along the range of the
  IV
• Jeremy rides his bicycle harder
  – Increase in speed depends on current speed
  – Not additive, multiplicative
  – MacCallum and Mar (1995). Distinguishing
    between moderator and quadratic effects in
    multiple regression. Psychological Bulletin.


                                                   442
  Assumption 5: The expected
correlation between residuals, for
       any two cases, is 0.

   The independence assumption (lack of
             autocorrelation)


                                          443
   Independence Assumption
• Also: lack of autocorrelation
• Tricky one
   – often ignored
   – exists for almost all tests
• All cases should be independent of one
  another
   – knowing the value of one case should not tell you
     anything about the value of other cases


                                                     444
        How is it Detected?
• Can be difficult
  – need some clever statistics (multilevel
    models)
• Better off avoiding situations where it
  arises
• Residual Plots
• Durbin-Watson Test

                                              445
              Residual Plots
• Were data collected in time order?
  – If so plot ID number against the residuals
  – Look for any pattern
     • Test for linear relationship
     • Non-linear relationship
     • Heteroscedasticity




                                             446
           2




           1
Residual




           0




           -1




           -2


                0   10           20           30   40
                         Participant Number


                                                        447
          How does it arise?
Two main ways
• time-series analyses
  – When cases are time periods
     • weather on Tuesday and weather on Wednesday
       correlated
     • inflation 1972, inflation 1973 are correlated
• clusters of cases
  – patients treated by three doctors
  – children from different classes
  – people assessed in groups

                                                       448
       Why does it matter?
• Standard errors can be wrong
  – therefore significance tests can be wrong
• Parameter estimates can be wrong
  – really, really wrong
  – from positive to negative
• An example
  – students do an exam (on statistics)
  – choose one of three questions
    • IV: time
    • DV: grade

                                                449
        •Result, with line of best fit
          90


          80


          70


          60


          50


          40


          30
Grade




          20

          10
            10        20   30   40       50   60   70


               Time                                 450
• Result shows that
  – people who spent longer in the exam,
    achieve better grades
• BUT …
  – we haven‟t considered which question
    people answered
  – we might have violated the independence
    assumption
    • DV will be autocorrelated
• Look again
  – with questions marked
                                           451
   • Now somewhat different

        90


        80


        70


        60


        50


        40
                                                  Question
        30
                                                      3
Grade




        20                                            2

        10                                            1
          10        20   30   40   50   60   70


             Time
                                                             452
• Now, people that spent longer got
  lower grades
  – questions differed in difficulty
  – do a hard one, get better grade
  – if you can do it, you can do it quickly
• Very difficult to analyse well
  – need multilevel models




                                              453
        Durbin Watson Test
• Not well implemented in SPSS
• Depends on the order of the data
  – Reorder the data, get a different result
• Doesn‟t give statistical significance of
  the test



                                               454
Assumption 6: All independent
    variables are uncorrelated
       with the error term.




                                 455
  Uncorrelated with the Error
             Term

• A curious assumption
  – by definition, the residuals are uncorrelated
    with the independent variables (try it and
    see, if you like)
• It is about the DV
  – must have no effect (when the IVs have
    been removed)
  – on the DV
                                               456
• Problem in economics
  – Demand increases supply
  – Supply increases wages
  – Higher wages increase demand
• OLS estimates will be (badly) biased in
  this case
  – need a different estimation procedure
  – two-stage least squares
     • simultaneous equation modelling



                                            457
Assumption 7: No independent
   variables are a perfect linear
  function of other independent
             variables

     no perfect multicollinearity



                                    458
   No Perfect Multicollinearity
• IVs must not be linear functions of one
  another
  – matrix of correlations of IVs is not positive definite
  – cannot be inverted
  – analysis cannot proceed
• Have seen this with
  – age, age start, time working
  – also occurs with subscale and total


                                                        459
• Large amounts of collinearity
  – a problem (as we shall see) sometimes
  – not an assumption




                                            460
Assumption 8: The mean of the
        error term is zero.


       You will like this one.




                                 461
 Mean of the Error Term = 0
• Mean of the residuals = 0
• That is what the constant is for
  – if the mean of the error term deviates from
    zero, the constant soaks it up

      Y   0  1 x1  
      Y  (  0  3)  1 x1  (  3)
- note, Greek letters because we are
  talking about population values
                                             462
• Can do regression without the constant
  – Usually a bad idea
  – E.g R2 = 0.995, p < 0.001
    • Looks good




                                       463
    13



    12



    11



    10
y




    9



    8



    7


         6   7   8   9        10   11   12   13
                         x1




                                                  464
465
 Lesson 9: Issues in
 Regression Analysis

      Things that alter the
interpretation of the regression
            equation

                                   466
           The Four Issues
•   Causality
•   Sample sizes
•   Collinearity
•   Measurement error




                             467
Causality




            468
          What is a Cause?
• Debate about definition of cause
  – some statistics (and philosophy) books try
    to avoid it completely
  – We are not going into depth
     • just going to show why it is hard
• Two dimensions of cause
  – Ultimate versus proximal cause
  – Determinate versus probabilistic
                                             469
Proximal versus Ultimate
• Why am I here?
  – I walked here because
  – This is the location of the class because
  – Eric Tanenbaum asked me because
  – (I don‟t know)
  – because I was in my office when he rang
   because
  – I am a lecturer at York because
  – I saw an advert in the paper because

                                                470
  – I exist because
  – My parents met because
  – My father had a job …

• Proximal cause
  – the direct and immediate cause of
    something
• Ultimate cause
  – the thing that started the process off
  – I fell off my bicycle because of the bump
  – I fell off because I was going too fast

                                                471
Determinate versus Probabilistic
  Cause
• Why did I fall off my bicycle?
  – I was going too fast
  – But every time I ride too fast, I don‟t fall
    off
  – Probabilistic cause
• Why did my tyre go flat?
  – A nail was stuck in my tyre
  – Every time a nail sticks in my tyre, the tyre
    goes flat
  – Deterministic cause
                                                   472
• Can get into trouble by mixing them
  together
  – Eating deep fried Mars Bars and doing no
    exercise are causes of heart disease
  – “My Grandad ate three deep fried Mars
    Bars every day, and the most exercise he
    ever got was when he walked to the shop
    next door to buy one”
  – (Deliberately?) confusing deterministic and
    probabilistic causes



                                             473
      Criteria for Causation
• Association
• Direction of Influence
• Isolation




                               474
                  Association
• Correlation does not mean causation
  – we all know
• But
  – Causation does mean correlation
• Need to show that two things are related
  – may be correlation
  – my be regression when controlling for third (or
    more) factor


                                                      475
• Relationship between price and sales
  – suppliers may be cunning
  – when people want it more
     • stick the price up

              Price         Demand   Sales
      Price             1      0.6       0
    Demand            0.6        1     0.6
      Sales             0      0.6       1


 – So – no relationship between price
   and sales
                                             476
  – Until (or course) we control for demand
  – b1 (Price) = -0.56
  – b2 (Demand) = 0.94
• But which variables do we enter?




                                              477
      Direction of Influence
• Relationship between A and B
  – three possible processes

         A               B       A causes B


         A               B      B causes A


         A               B     C causes A & B


                 C                            478
 • How do we establish the direction of
   influence?
   – Longitudinally?


      Barometer
                               Storm
        Drops

  – Now if we could just get that barometer
    needle to stay where it is …

• Where the role of theory comes in
  (more on this later)
                                              479
                  Isolation
• Isolate the dependent variable from all
  other influences
  – as experimenters try to do
• Cannot do this
  – can statistically isolate the effect
  – using multiple regression



                                           480
            Role of Theory
• Strong theory is crucial to making
  causal statements
• Fisher said: to make causal statements
  “make your theories elaborate.”
  – don‟t rely purely on statistical analysis
• Need strong theory to guide analyses
  – what critics of non-experimental research
    don‟t understand

                                                481
• S.J. Gould – a critic
  – says correlate price of petrol and his age,
    for the last 10 years
  – find a correlation
  – Ha! (He says) that doesn‟t mean there is a
    causal link
  – Of course not! (We say).
     • No social scientist would do that analysis
       without first thinking (very hard) about the
       possible causal relations between the variables
       of interest
     • Would control for time, prices, etc …

                                                    482
• Atkinson, et al. (1996)
  – relationship between college grades and
    number of hours worked
  – negative correlation
  – Need to control for other variables –
    ability, intelligence
• Gould says “Most correlations are non-
  causal” (1982, p243)
  – Of course!!!!



                                              483
                                  laugh
                                  toilet
                       jokes (about statistics)
                                  vomit
                                karaoke
                           curtains closed
                                sleeping
I drink a lot of              headache
      beer              equations (beermat)
                                 thirsty
  16 causal                fried breakfast
  relations                     no beer
                                  curry
                                  chips
      120 non-causal         falling over
       correlations            lose keys
                                            484
• Abelson (1995) elaborates on this
  – „method of signatures‟
• A collection of correlations relating to
  the process
  – the „signature‟ of the process
• e.g. tobacco smoking and lung cancer
  – can we account for all of these findings
    with any other theory?




                                               485
1.   The longer a person has smoked cigarettes, the
     greater the risk of cancer.
2.   The more cigarettes a person smokes over a given
     time period, the greater the risk of cancer.
3.   People who stop smoking have lower cancer rates
     than do those who keep smoking.
4.   Smoker‟s cancers tend to occur in the lungs, and be of
     a particular type.
5.   Smokers have elevated rates of other diseases.
6.   People who smoke cigars or pipes, and do not usually
     inhale, have abnormally high rates of lip cancer.
7.   Smokers of filter-tipped cigarettes have lower cancer
     rates than other cigarette smokers.
8.   Non-smokers who live with smokers have elevated
     cancer rates.
                            (Abelson, 1995: 183-184)
                                                      486
  – In addition, should be no anomalous
    correlations
     • If smokers had more fallen arches than non-
       smokers, not consistent with theory
• Failure to use theory to select
  appropriate variables
  – specification error
  – e.g. in previous example
  – Predict wealth from price and sales
     • increase price, price increases
     • Increase sales, price increases

                                                     487
• Sometimes these are indicators of the
  process
  – e.g. barometer – stopping the needle won‟t
    help
  – e.g. inflation? Indicator or cause?




                                            488
      No Causation without
        Experimentation
• Blatantly untrue
  – I don‟t doubt that the sun shining makes
    us warm
• Why the aversion?
  – Pearl (2000) says problem is no
    mathematical operator
  – No one realised that you needed one
  – Until you build a robot

                                               489
          AI and Causality
• A robot needs to make judgements
  about causality
• Needs to have a mathematical
  representation of causality
  – Suddenly, a problem!
  – Doesn‟t exist
    • Most operators are non-directional
    • Causality is directional
                                           490
       Sample Sizes

“How many subjects does it take
 to run a regression analysis?”



                                  491
               Introduction
• Social scientists don‟t worry enough about the
  sample size required
  – “Why didn‟t you get a significant result?”
  – “I didn‟t have a large enough sample”
     • Not a common answer
• More recently awareness of sample size is
  increasing
  – use too few – no point doing the research
  – use too many – waste their time
                                                 492
• Research funding bodies
• Ethical review panels
  – both become more interested in sample
    size calculations
• We will look at two approaches
  – Rules of thumb (quite quickly)
  – Power Analysis (more slowly)




                                            493
            Rules of Thumb
• Lots of simple rules of thumb exist
  – 10 cases per IV
  – >100 cases
  – Green (1991) more sophisticated
     • To test significance of R2 – N = 50 + 8k
     • To test sig of slopes, N = 104 + k
• Rules of thumb don‟t take into account
  all the information that we have
  – Power analysis does

                                                  494
            Power Analysis
Introducing Power Analysis
• Hypothesis test
  – tells us the probability of a result of that
    magnitude occurring, if the null hypothesis
    is correct (i.e. there is no effect in the
    population)
• Doesn‟t tell us
  – the probability of that result, if the null
    hypothesis is false
                                                  495
• According to Cohen (1982) all null
  hypotheses are false
  – everything that might have an effect, does
    have an effect
     • it is just that the effect is often very tiny




                                                       496
Type I Errors
• Type I error is false rejection of H0
• Probability of making a type I error
  – a – the significance value cut-off
     • usually 0.05 (by convention)
• Always this value
• Not affected by
  – sample size
  – type of test


                                          497
Type II errors
• Type II error is false acceptance of the
  null hypothesis
  – Much, much trickier
• We think we have some idea
  – we almost certainly don‟t
• Example
  – I do an experiment (random sampling, all
    assumptions perfectly satisfied)
  – I find p = 0.05

                                               498
  – You repeat the experiment exactly
    • different random sample from same population
  – What is probability you will find p < 0.05?
  – ………………
  – Another experiment, I find p = 0.01
  – Probability you find p < 0.05?
  – ………………
• Very hard to work out
  – not intuitive
  – need to understand non-central sampling
    distributions (more in a minute)
                                                499
• Probability of type II error = beta ()
  – same as population regression parameter
    (to be confusing)
• Power = 1 – Beta
  – Probability of getting a significant result




                                                  500
                                    State of the World


                                 H0 True         H0 false
                               (no effect to   (effect to be
                                be found)         found)


           H0 true (we find                    Type II error


Research
           no effect – p >
                 0.05)                           p=
                                               power = 1 - 

Findings
           H0 false (we find
            an effect – p <
                 0.05)
                               Type I error
                                  p=a              
                                                          501
• Four parameters in power analysis
  – a – prob. of Type I error
  –  – prob. of Type II error (power = 1 – )
  – Effect size – size of effect in population
  –N
• Know any three, can calculate the
  fourth
  – Look at them one at a time




                                            502
•   a Probability of Type I error
    – Usually set to 0.05
    – Somewhat arbitrary
      • sometimes adjusted because of circumstances
         – rarely because of power analysis
    – May want to adjust it, based on power
      analysis




                                                  503
•  – Probability of type II error
  – Power (probability of finding a result)
  =1–
  – Standard is 80%
     • Some argue for 90%
  – Implication that Type I error is 4 times
    more serious than type II error
     • adjust ratio with compromise power analysis




                                                     504
•   Effect size in the population
    – Most problematic to determine
    – Three ways
    1. What effect size would be useful to find?
      •   R2 = 0.01 - no use (probably)
    2. Base it on previous research
      – what have other people found?
    3. Use Cohen‟s conventions
      – small R2 = 0.02
      – medium R2 = 0.13
      – large R2 = 0.26

                                               505
– Effect size usually measured as f2
– For R2

                        2
                R
           f 
             2

               1 R 2




                                       506
– For (standardised) slopes

                         2
            2  sri
          f       2
              1 R
– Where sr2 is the contribution to the
  variance accounted for by the variable of
  interest
– i.e. sr2 = R2 (with variable) – R2 (without)
   • change in R2 in hierarchical regression


                                                 507
• N – the sample size
  – usually use other three parameters to
    determine this
  – sometimes adjust other parameters (a)
    based on this
  – e.g. You can have 50 participants. No
    more.




                                            508
Doing power analysis
• With power analysis program
  – SamplePower, GPower, Nquery


• With SPSS MANOVA
  – using non-central distribution functions
  – Uses MANOVA syntax
    • Relies on the fact you can do anything with
      MANOVA
    • Paper B4


                                                    509
     Underpowered Studies
• Research in the social sciences is often
  underpowered
  – Why?
  – See Paper B11 – “the persistence of
    underpowered studies”




                                          510
             Extra Reading
• Power traditionally focuses on p values
  – What about CIs?
  – Paper B8 – “Obtaining regression
    coefficients that are accurate, not simply
    significant”




                                                 511
Collinearity




               512
    Collinearity as Issue and
           Assumption
• Collinearity (multicollinearity)
  – the extent to which the independent
    variables are (multiply) correlated
• If R2 for any IV, using other IVs = 1.00
  – perfect collinearity
  – variable is linear sum of other variables
  – regression will not proceed
  – (SPSS will arbitrarily throw out a variable)
                                               513
• R2 < 1.00, but high
  – other problems may arise
• Four things to look at in collinearity
  – meaning
  – implications
  – detection
  – actions




                                           514
      Meaning of Collinearity
• Literally „co-linearity‟
  – lying along the same line
• Perfect collinearity
  – when some IVs predict another
  – Total = S1 + S2 + S3 + S4
  – S1 = Total – (S2 + S3 + S4)
  – rare

                                    515
• Less than perfect
  – when some IVs are close to predicting
  – correlations between IVs are high (usually,
    but not always)




                                             516
             Implications
• Effects the stability of the parameter
  estimates
  – and so the standard errors of the
    parameter estimates
  – and so the significance
• Because
  – shared variance, which the regression
    procedure doesn‟t know where to put
                                            517
• Red cars have more accidents than
  other coloured cars
  – because of the effect of being in a red car?
  – because of the kind of person that drives a
    red car?
    • we don‟t know
  – No way to distinguish between these three:
   Accidents = 1 x colour + 0 x person
   Accidents = 0 x colour + 1 x person
 Accidents = 0.5 x colour + 0.5 x person

                                              518
• Sex differences
  – due to genetics?
  – due to upbringing?
  – (almost) perfect collinearity
     • statistically impossible to tell




                                          519
• When collinearity is less than perfect
  – increases variability of estimates between
    samples
  – estimates are unstable
  – reflected in the variances, and hence
    standard errors




                                                 520
      Detecting Collinearity
• Look at the parameter estimates
  – large standardised parameter estimates
    (>0.3?), which are not significant
     • be suspicious
• Run a series of regressions
  – each IV as DV
  – all other IVs as IVs
     • for each IV
                                             521
• Sounds like hard work?
  – SPSS does it for us!
• Ask for collinearity diagnostics
  – Tolerance – calculated for every IV
                                   2
       Tolerance  1-R
 – Variance Inflation Factor
    • sq. root of amount s.e. has been increased

                  1
        VIF 
              Tolerance
                                                   522
                      Actions
What you can do about collinearity
             “no quick fix” (Fox, 1991)
1. Get new data
  •   avoids the problem
  •   address the question in a different way
  •   e.g. find people who have been raised as
      the „wrong‟ gender
      •   exist, but rare
  •   Not a very useful suggestion
                                             523
2. Collect more data
  •   not different data, more data
  •   collinearity increases standard error (se)
  •   se decreases as N increases
      •   get a bigger N
3. Remove / Combine variables
  •   If an IV correlates highly with other IVs
  •   Not telling us much new
  •   If you have two (or more) IVs which are
      very similar
      •   e.g. 2 measures of depression, socio-
          economic status, achievement, etc
                                                  524
      •   sum them, average them, remove one
  •   Many measures
      •   use principal components analysis to reduce
          them
3. Use stepwise regression (or some
   flavour of)
  •   See previous comments
  •   Can be useful in theoretical vacuum
4. Ridge regression
  •   not very useful
  •   behaves weirdly

                                                        525
Measurement Error




                    526
  What is Measurement Error
• In social science, it is unlikely that we
  measure any variable perfectly
  – measurement error represents this
    imperfection
• We assume that we have a true score
  – T
• A measure of that score
  –x

                                              527
              x T e
• just like a regression equation
  – standardise the parameters
  – T is the reliability
     • the amount of variance in x which comes from T
• but, like a regression equation
  – assume that e is random and has mean of zero
  – more on that later


                                                   528
        Simple Effects of
       Measurement Error
• Lowers the measured correlation
  – between two variables
• Real correlation
  – true scores (x* and y*)
• Measured correlation
  – measured scores (x and y)


                                    529
               True correlation
                  of x and y
                     rx*y*



          x*                      y*


    Reliability of x      Reliability of y
          rxx                   ryy

e          x                      y          e



               Measured
          correlation of x and y
                    rxy                          530
• Attenuation of correlation


    rxy  rx * y *  rxx ryy
• Attenuation corrected correlation

                       rxy
         rx * y * 
                      rxx ryy
                                      531
• Example

rxx  0.7
ryy  0.8   rx* y* 
                         rxy
rxy  0.3               rxx ryy
                         0.3
            rx* y*               0.40
                       0.7  0.8


                                          532
        Complex Effects of
        Measurement Error
• Really horribly complex
• Measurement error reduces correlations
  – reduces estimate of 
  – reducing one estimate
    • increases others
  – because of effects of control
  – combined with effects of suppressor
    variables
  – exercise to examine this
                                          533
   Dealing with Measurement
              Error

• Attenuation correction
   – very dangerous
   – not recommended
• Avoid in the first place
   – use reliable measures
   – don‟t discard information
      • don‟t categorise
      • Age: 10-20, 21-30, 31-40 …

                                     534
               Complications
• Assume measurement error is
  – additive
  – linear
• Additive
  – e.g. weight – people may under-report / over-
    report at the extremes
• Linear
  – particularly the case when using proxy variables


                                                       535
• e.g. proxy measures
  – Want to know effort on childcare, count
    number of children
    • 1st child is more effort than last
  – Want to know financial status, count
    income
    • 1st £10 much greater effect on financial status
      than the 1000th.




                                                    536
Lesson 10: Non-Linear
Analysis in Regression




                         537
             Introduction
• Non-linear effect occurs
  – when the effect of one independent
    variable
  – is not consistent across the range of the IV
• Assumption is violated
  – expected value of residuals = 0
  – no longer the case

                                              538
Some Examples




                539
        A Learning Curve
Skill




             Experience    540
Performance
              Yerkes-Dodson Law of Arousal




                        Arousal              541
                   Enthusiasm Levels over a
                     Lesson on Regression
Enthusiastic
Suicidal




               0                              3.5
                             Time                   542
• Learning
  – line changed direction once
• Yerkes-Dodson
  – line changed direction once
• Enthusiasm
  – line changed direction twice




                                   543
    Everything is Non-Linear
• Every relationship we look at is non-
  linear, for two reasons
  – Exam results cannot keep increasing with
    reading more books
     • Linear in the range we examine
  – For small departures from linearity
     • Cannot detect the difference
     • Non-parsimonious solution

                                               544
Non-Linear Transformations




                             545
            Bending the Line
• Non-linear regression is hard
  – We cheat, and linearise the data
     • Do linear regression
Transformations
• We need to transform the data
  – rather than estimating a curved line
     • which would be very difficult
     • may not work with OLS
  – we can take a straight line, and bend it
  – or take a curved line, and straighten it
     • back to linear (OLS) regression

                                               546
• We still do linear regression
  – Linear in the parameters
  – Y = b1x + b2x2 + …
• Can do non-linear regression
  – Non-linear in the parameters
  – Y = b1x + b2x2 + …
• Much trickier
  – Statistical theory either breaks down OR
    becomes harder

                                               547
• Linear transformations
  – multiply by a constant
  – add a constant
  – change the slope and the intercept




                                         548
                    y=2x
    y=x + 3
y




              y=x

        x
                           549
• Linear transformations are no use
  – alter the slope and intercept
  – don‟t alter the standardised parameter
    estimate
• Non-linear transformation
  – will bend the slope
  – quadratic transformation
                     y = x2
  – one change of direction


                                             550
– Cubic transformation
                 y = x2 + x3
– two changes of direction




                               551
  Quadratic Transformation

y=0 + 0.1x + 1x2




                             552
Square Root Transformation




        y=20 + -3x + 5x


                             553
        Cubic Transformation

y = 3 - 4x + 2x2 - 0.2x3
6
5
4
3
2
1
0
    0   1    2    3        4   5    6


                                   554
    Logarithmic Transformation

y = 1 + 0.1x + 10log(x)




                                 555
Inverse Transformation


  y = 20 -10x + 8(1/x)




                         556
• To estimate a non-linear regression
  – we don‟t actually estimate anything non-
    linear
  – we transform the x-variable to a non-linear
    version
  – can estimate that straight line
  – represents the curve
  – we don‟t bend the line, we stretch the
    space around the line, and make it flat



                                             557
Detecting Non-linearity




                          558
         Draw a Scatterplot
• Draw a scatterplot of y plotted against x
  – see if it looks a bit non-linear
  – e.g. Anscombe‟s data
  – e.g. Education and beginning salary
     • from bank data
     • drawn in SPSS
     • with line of best fit


                                          559
 • Anscombe (1973)
    – constructed a set of datasets
    – show the importance of graphs in
      regression/correlation
 • For each dataset
N                                      11
Mean of x                               9
Mean of y                             7.5
Equation of regression line      y = 3 + 0.5x
sum of squares (X - mean)             110
correlation coefficient              0.82
R2                                   0.67
                                                560
561
562
563
564
          A Real Example
• Starting salary and years of education
  – From employee data.sav




                                           565
                  Expected value
                      of error
                 (residual) is > 0




                                      Expected value
                                          of error
Educational Level (years)            (residual) is < 0
                                                 566
         Use Residual Plot
• Scatterplot is only good for one variable
  – use the residual plot (that we used for
    heteroscedasticity)
• Good for many variables




                                              567
• We want
  – points to lie in a nice straight sausage




                                               568
• We don‟t want
  – a nasty bent sausage




                           569
• Educational level and starting salary
 10



  8



  6



  4



  2



  0



 -2
   -2    -1     0      1      2      3


                                          570
Carrying Out Non-Linear
       Regression




                          571
      Linear Transformation
• Linear transformation doesn‟t change
  – interpretation of slope
  – standardised slope
  – se, t, or p of slope
  – R2
• Can change
  – effect of a transformation

                                         572
• Actually more complex
  – with some transformations can add a
    constant with no effect (e.g. quadratic)
• With others does have an effect
  – inverse, log
• Sometimes it is necessary to add a
  constant
  – negative numbers have no square root
  – 0 has no log



                                               573
       Education and Salary
Linear Regression
• Saw previously that the assumption of
  expected errors = 0 was violated
• Anyway …
  – R2 = 0.401, F=315, df = 1, 472, p < 0.001
  – salbegin = -6290 + 1727  educ
  – Standardised
     • b1 (educ) = 0.633
  – Both parameters make sense
                                                574
Non-linear Effect
• Compute new variable
  – quadratic
  – educ2 = educ2
• Add this variable to the equation
  – R2 = 0.585, p < 0.001
  – salbegin = 46263 + -6542  educ + 310  educ2
     • slightly curious
  – Standardised
     • b1 (educ) = -2.4
     • b2 (educ2) = 3.1
  – What is going on?

                                             575
• Collinearity
  – is what is going on
  – Correlation of educ and educ2
     • r = 0.990
  – Regression equation becomes difficult
    (impossible?) to interpret
• Need hierarchical regression
  – what is the change in R2
  – is that change significant?
  – R2 (change) = 0.184, p < 0.001

                                            576
Cubic Effect
• While we are at it, let‟s look at the cubic
  effect
  – R2 (change) = 0.004, p = 0.045
  – 19138 + 103  e + -206  e2 + 12  e3
  – Standardised:
  b1(e) = 0.04
  b2(e2) = -2.04
  b3(e3) = 2.71


                                            577
Fourth Power
• Keep going while we are ahead
  – won‟t run
     • ???
• Collinearity is the culprit
  – Tolerance (educ4) = 0.000005
  – VIF = 215555
• Matrix of correlations of IVs is not
  positive definite
  – cannot be inverted

                                         578
Interpretation
• Tricky, given that parameter estimates
  are a bit nonsensical
• Two methods
• 1: Use R2 change
  – Save predicted values
     • or calculate predicted values to plot line of best
       fit
  – Save them from equation
  – Plot against IV

                                                       579
50000



40000



30000



20000



10000                                            Cubic

                                                 Quadratic

    0                                            Linear
        8   10    12    14   16   18   20   22


        Education (Years)                                 580
• Differentiate with respect to e
• We said:
s = 19138 + 103  e + -206  e2 + 12  e3
  – but first we will simplify it to quadratic
   s = 46263 + -6542  e + 310  e2


• dy/dx = -6542 + 310 x 2 x e




                                                 581
Education Slope
        9      -962
       10      -342
       11       278
       12       898
       13      1518
       14      2138
                       1 year of education
       15      2758    at the higher end of
       16      3378   the scale, better than
       17      3998    1 year at the lower
       18      4618      end of the scale.
                        MBA versus GCSE
       19      5238
       20      5858
                                        582
• Differentiate Cubic
  19138 + 103  e + -206  e2 + 12  e3

 dy/dx = 103 – 206  2  e + 12  3  e2

• Can calculate slopes for quadratic and
  cubic at different values




                                           583
Education Slope (Quad) Slope (Cub)
        9          -962        -689
       10          -342        -417
       11           278         -73
       12           898         343
       13         1518          831
       14         2138         1391
       15         2758         2023
       16         3378         2727
       17         3998         3503
       18         4618         4351
       19         5238         5271
       20         5858         6263
                                      584
          A Quick Note on
           Differentiation
• For y = xp
  – dx/dy = pxp-1
• For equations such as
  y =b1x + b2xP
  dy/dx = b1 + b2pxp-1

• y = 3x + 4x2
  – dy/dx = 3 + 4 • 2x
                             585
• y = b1x + b2x2 + b3x3
  – dy/dx = b1 + b2 • 2x + b3 • 3 • x2


• y = 4x + 5x2 + 6x3
• dx/dy = 4 + 5 • 2 • x + 6 • 3 • x2

• Many functions are simple to
  differentiate
  – Not all though


                                         586
    Automatic Differentiation
• If you
  – Don‟t know how to differentiate
  – Can‟t be bothered to look up the function
• Can use automatic differentiation
  software
  – e.g. GRAD (freeware)



                                                587
588
Lesson 11: Logistic Regression

  Dichotomous/Nominal Dependent
            Variables



                                  589
                 Introduction
• Often in social sciences, we have a
  dichotomous/nominal DV
  – we will look at dichotomous first, then a quick look
    at multinomial
• Dichotomous DV
• e.g.
  –   guilty/not guilty
  –   pass/fail
  –   won/lost
  –   Alive/dead (used in medicine)
                                                     590
Why Won‟t OLS Do?




                    591
     Example: Passing a Test
• Test for bus drivers
  – pass/fail
  – we might be interested in degrees of pass fail
     • a company which trains them will not
     • fail means „pay for them to take it again‟
• Develop a selection procedure
  – Two predictor variables
  – Score – Score on an aptitude test
  – Exp – Relevant prior experience (months)

                                                     592
• 1st ten cases
       Score      Exp   Pass
         5         6     0
         1         15    0
         1         12    0
         4         6     0
         1         15    1
         1         6     0
         4         16    1
         1         10    1
         3         12    0
         4         26    1
                               593
• DV
  – pass (1 = Yes, 0 = No)
• Just consider score first
  – Carry out regression
  – Score as IV, Pass as DV
  – R2 = 0.097, F = 4.1, df = 1, 48, p = 0.028.
  – b0 = 0.190
  – b1 = 0.110, p=0.028
       • Seems OK



                                             594
• Or does it? …
• 1st Problem – pp plot of residuals
   1.00




    .75




    .50




    .25




   0.00
      0.00        .25         .50   .75   1.00


          Observed Cum Prob                      595
• 2nd problem - residual plot




                                596
• Problems 1 and 2
  – strange distributions of residuals
  – parameter estimates may be wrong
  – standard errors will certainly be wrong




                                              597
• 3rd problem – interpretation
  – I score 2 on aptitude.
  – Pass = 0.190 + 0.110  2 = 0.41
  – I score 8 on the test
  – Pass = 0.190 + 0.110  8 = 1.07
• Seems OK, but
  – What does it mean?
  – Cannot score 0.41 or 1.07
     • can only score 0 or 1
• Cannot be interpreted
  – need a different approach
                                      598
A Different Approach
 Logistic Regression




                       599
       Logit Transformation
• In lesson 10, transformed IVs
  – now transform the DV
• Need a transformation which gives us
  – graduated scores (between 0 and 1)
  – No upper limit
    • we can‟t predict someone will pass twice
  – No lower limit
    • you can‟t do worse than fail
                                                 600
Step 1: Convert to Probability
• First, stop talking about values
  – talk about probability
  – for each value of score, calculate
    probability of pass
• Solves the problem of graduated scales



                                         601
  probability of
 failure given a
score of 1 is 0.7
         Score 1 2 3 4 5
         N       7 5 6 4 2
    Fail
         P     0.7 0.5 0.6 0.4 0.2
         N       3 5 4 6 8
    Pass
         P     0.3 0.5 0.4 0.6 0.8
    probability of
   passing given a
   score of 5 is 0.8
                                     602
This is better
• Now a score of 0.41 has a meaning
  – a 0.41 probability of pass
• But a score of 1.07 has no meaning
  – cannot have a probability > 1 (or < 0)
  – Need another transformation




                                             603
Step 2: Convert to Odds-Ratio
Need to remove upper limit
• Convert to odds
• Odds, as used by betting shops
  – 5:1, 1:2
• Slightly different from odds in speech
  – a 1 in 2 chance
  – odds are 1:1 (evens)
  – 50%
                                           604
• Odds ratio = (number of times it
  happened) / (number of times it didn‟t
  happen)


                p(event)      p(event )
 odds ratio               
              p(not event ) 1  p(event )




                                            605
• 0.8 = 0.8/0.2 = 4
  – equivalent to 4:1 (odds on)
  – 4 times out of five
• 0.2 = 0.2/0.8 = 0.25
  – equivalent to 1:4 (4:1 against)
  – 1 time out of five

                                      606
• Now we have solved the upper bound
  problem
  – we can interpret 1.07, 2.07, 1000000.07
• But we still have the zero problem
  – we cannot interpret predicted scores less
    than zero




                                                607
         Step 3: The Log
• Log10 of a number(x)
          log( x )
    10                x
     • log(10) = 1
     • log(100) = 2
     • log(1000) = 3
                           608
• log(1) = 0
• log(0.1) = -1
• log(0.00001) = -5




                      609
           Natural Logs and e
• Don‟t use log10
  – Use loge
• Natural log, ln
• Has some desirable properties, that log10
  doesn‟t
  –   For us
  –   If y = ln(x) + c
  –    dy/dx = 1/x
  –   Not true for any other logarithm

                                              610
• Be careful – calculators and stats
  packages are not consistent when they
  use log
  – Sometimes log10, sometimes loge
  – Can prove embarrassing (a friend told me)




                                            611
Take the natural log of the odds ratio
• Goes from -  +
  – can interpret any predicted value




                                         612
   Putting them all together
• Logit transformation
  – log-odds ratio
  – not bounded at zero or one




                                 613
         Score 1      2    3    4     5
            N    7   5   6   4        2
Fail
            P   0.7 0.5 0.6 0.4      0.2
            N    3   5   4   6        8
Pass
            P   0.3 0.5 0.4 0.6      0.8
 Odds (Fail)    2.33 1.00 1.50 0.67 0.25
log(odds)fail   0.85 0.00 0.41 -0.41 -1.39




                                        614
                1
              0.9
              0.8
              0.7
probability




                                 Probability gets closer
              0.6
                                   to zero, but never
              0.5                  reaches it as logit
              0.4                     goes down.
              0.3
              0.2
              0.1
                0
                    -3.5   -3   -2.5   -2   -1.5   -1   -0.5   0   0.5   1   1.5   2   2.5    3    3.5

                                                           Logit


                                                                                             615
• Hooray! Problem solved, lesson over
  – errrmmm… almost
• Because we are now using log-odds
  ratio, we can‟t use OLS
  – we need a new technique, called Maximum
    Likelihood (ML) to estimate the parameters




                                            616
  Parameter Estimation using
             ML
ML tries to find estimates of model
  parameters that are most likely to give
  rise to the pattern of observations in
  the sample data
• All gets a bit complicated
  – OLS is a special case of ML
  – the mean is an ML estimator

                                            617
• Don‟t have closed form equations
  – must be solved iteratively
  – estimates parameters that are most likely
    to give rise to the patterns observed in the
    data
  – by maximising the likelihood function (LF)
• We aren‟t going to worry about this
  – except to note that sometimes, the
    estimates do not converge
     • ML cannot find a solution



                                              618
        Interpreting Output
Using SPSS
• Overall fit for:
  – step (only used for stepwise)
  – block (for hierarchical)
  – model (always)
  – in our model, all are the same
  – c2=4.9, df=1, p=0.025
     • F test

                                     619
          Om nibus Tests of Model Coe fficients

                    Chi-square      df            Sig.
St ep 1    St ep         4.990           1          .025
           Block         4.990           1          .025
           Model         4.990           1          .025




                                                         620
• Model summary
  – -2LL (=c2/N)
  – Cox & Snell R2
  – Nagelkerke R2
  – Different versions of R2
     • No real R2 in logistic regression
     • should be considered „pseudo R2‟




                                           621
               Model Summa ry

           -2 Log     Cox & Snell   Nagelk erke
St ep   lik elihood    R Square      R Square
1          64.245            .095          .127




                                                  622
• Classification Table
  – predictions of model
  – based on cut-off of 0.5 (by default)
  – predicted values x actual values




                                           623
                                Cl assi fication Tablea

                                                            Predic ted

                                                     PASS
                                                                         Percentage
          Observed                               0            1           Correc t
St ep 1   PASS                    0                  18            8           69.2
                                  1                  12           12           50.0
          Overall Percent age
                                                                               60.0


  a. The cut value is .500




                                                                                 624
Model parameters
•B
  – Change in the logged odds associated with
    a change of 1 unit in IV
  – just like OLS regression
  – difficult to interpret
• SE (B)
  – Standard error
  – Multiply by 1.96 to get 95% CIs



                                           625
                Va riables in the Equa tion

                             B                 S. E.          W ald
St ep
  a
         SCORE                 -.467              .219         4.566
1        Constant            1.314                .714         3.390
   a. Variable(s) ent ered on step 1: SCORE.

                   Variable s in the Equation

                                                 95.0% C.I.for EXP(B)
                      Sig.        Exp(B)          Lower        Upper
Step
 a
        score           .386           1.263           .744      2.143
1       Constant        .199            .323
  a. Variable(s) entered on step 1: score.

                                                                       626
• Constant
  – i.e. score = 0
  – B = 1.314
  – Exp(B) = eB = e1.314 = 3.720
  – OR = 3.720, p = 1 – (1 / (OR + 1))
    = 1 – (1 / (3.720 + 1))
  – p = 0.788




                                         627
• Score 1
  – Constant b = 1.314
  – Score B = -0.467
  – Exp(1.314 – 0.467) = Exp(0.847)
     = 2.332
  – OR = 2.332
  – p = 1 – (1 / (2.332 + 1))
    = 0.699




                                      628
    Standard Errors and CIs
• SPSS gives
  – B, SE B, exp(B) by default
  – Can work out 95% CI from standard error
  – B ± 1.96 x SE(B)
  – Or ask for it in options
• Symmetrical in B
  – Non-symmetrical (sometimes very) in
    exp(B)

                                          629
             Va riables in the Equa tion

                                            95.0% C.I. for
                                               EXP(B)
              B       S. E.   Ex p(B)       Lower    Upper
SCORE         -.467    .219      .627        .408      .962
Constan
             1.314     .714     3.720
t
a. Variable(s) entered on s tep 1: SCORE.




                                                         630
• The odds of passing the test are
  multiplied by 0.63 (CIs = 0.408, 0.962p
  p = 0.033), for every additional point
  on the aptitude test.




                                        631
    More on Standard Errors
• In OLS regression
  – If a variable is added in a hierarchical fashion
  – The p-value associated with the change in R2 is
    the same as the p-value of the variable
  – Not the case in logistic regression
     • In our data 0.025 and 0.033
• Wald standard errors
  – Make p-value in estimates is wrong – too high
  – (CIs still correct)

                                                       632
• Two estimates use slightly different
  information
  – P-value says “what if no effect”
  – CI says “what if this effect”
     • Variance depends on the hypothesised ratio of the
       number of people in the two groups
• Can calculate likelihood ratio based p-
  values
  – If you can be bothered
  – Some packages provide them automatically
                                                     633
          Probit Regression
• Very similar to logistic
  – much more complex initial transformation
    (to normal distribution)
  – Very similar results to logistic (multiplied by
    1.7)
• In SPSS:
  – A bit weird
     • Probit regression available through menus


                                                   634
  – But requires data structured differently
• However
  – Ordinal logistic regression is equivalent to
    binary logistic
     • If outcome is binary
  – SPSS gives option of probit




                                                   635
                    Results
                       Estimate    SE      P

Logistic    Score       0.288     0.301   0.339
(binary)
            Exp         0.147     0.073   0.043
Logistic    Score       0.288     0.301   0.339
(ordinal)   Exp         0.147     0.073   0.043
Logistic    Score       0.191     0.178   0.282
(probit)    Exp         0.090     0.042   0.033


                                            636
Differentiating Between Probit
          and Logistic
• Depends on shape of the error term
   – Normal or logistic
   – Graphs are very similar to each other
      • Could distinguish quality of fit
          – Given enormous sample size
• Logistic = probit x 1.7
   – Actually 1.6998
• Probit advantage
   – Understand the distribution
• Logistic advantage
   – Much simpler to get back to the probability

                                                   637
             0
                 0.2
                       0.4
                             0.6
                                   0.8
                                                    1
                                                           1.2
       -3
      -2.8
      -2.6
      -2.4
      -2.2
       -2
      -1.8
      -1.6
      -1.4
      -1.2
       -1
      -0.8
                                         Logistic




      -0.6
      -0.4
      -0.2
                                         Normal (Probit)




        0
      0.2
      0.4
      0.6
      0.8
        1
      1.2
      1.4
      1.6
      1.8
        2
      2.2
      2.4
      2.6
      2.8
638




        3
        Infinite Parameters
• Non-convergence can happen because
  of infinite parameters
  – Insoluble model
• Three kinds:
• Complete separation
  – The groups are completely distinct
    • Pass group all score more than 10
    • Fail group all score less than 10

                                          639
• Quasi-complete separation
  – Separation with some overlap
     • Pass group all score 10 or more
     • Fail group all score 10 or less
• Both cases:
  – No convergence
• Close to this
  – Curious estimates
  – Curious standard errors

                                         640
• Categorical Predictors
   – Can cause separation
   – Esp. if correlated
          • Need people in every cell

                         Male                   Female

                 White      Non-White   White       Non-White
Below
Poverty
Line
Above
Poverty
Line                                                        641
      Logistic Regression and
              Diagnosis
• Logistic regression can be used for diagnostic
  tests
   – For every score
      • Calculate probability that result is positive
      • Calculate proportion of people with that score (or lower)
        who have a positive result
• Calculate c statistic
   – Measure of discriminative power
   – %age of all possible cases, where the model gives
     a higher probability to a correct case than to an
     incorrect case
                                                               642
  – Perfect c-statistic = 1.0
  – Random c-statistic = 0.5
• SPSS doesn‟t do it automatically
  – But easy to do
• Save probabilities
  – Use Graphs, ROC Curve
  – Test variable: predicted probability
  – State variable: outcome



                                           643
    Sensitivity and Specificity
• Sensitivity:
  – Probability of saying someone has a
    positive result –
     • If they do: p(pos)|pos
• Specificity
  – Probability of saying someone has a
    negative result
     • If they do: p(neg)|neg

                                          644
   Calculating Sens and Spec
• For each value
  – Calculate
     • proportion of minority earning less – p(m)
     • proportion of non-minority earning less – p(w)
  – Sensitivity (value)
     • P(m)




                                                    645
Salary   P(minority)
 10         .39
 20         .31
 30         .23
 40         .17
 50         .12
 60         .09
 70         .06
 80         .04
 90         .03
                       646
          Using Bank Data
• Predict minority group, using salary
  (000s)
  – Logit(minority) = -0.044 + salary x –0.039
• Find actual proportions




                                             647
                                ROC Curve
              1.0



              0.8
Sensitivity




              0.6



              0.4

                                                         Area under curve
              0.2                                          is c-statistic

              0.0
                    0.0   0.2      0.4     0.6     0.8     1.0
                                 1 - Specificity
               Diagonal segments are produced by ties.
                                                                        648
  More Advanced Techniques
• Multinomial Logistic Regression more
  than two categories in DV
  – same procedure
  – one category chosen as reference group
    • odds of being in category other than reference
• Polytomous Logit Universal Models
  (PLUM)
  – Ordinal multinomial logistic regression
  – For ordinal outcome variables
                                                   649
            Final Thoughts
• Logistic Regression can be extended
  – dummy variables
  – non-linear effects
  – interactions (even though we don‟t cover
    them until the next lesson)
• Same issues as OLS
  – collinearity
  – outliers

                                               650
651
652
Lesson 12: Mediation and Path
           Analysis




                           653
                Introduction
• Moderator
   – Level of one variable influences effect of another
     variable
• Mediator
   – One variable influences another via a third
     variable
• All relationships are really mediated
   – are we interested in the mediators?
   – can we make the process more explicit
                                                      654
• In examples with bank


                            beginning
 education
                              salary


• Why?
  – What is the process?
  – Are we making assumptions about the
    process?
  – Should we test those assumptions?
                                          655
             job skills



            expectations
                           beginning
education
                             salary
            negotiating
               skills


               kudos
              for bank

                                 656
Direct and Indirect Influences
X may affect Y in two ways
• Directly – X has a direct (causal)
  influence on Y
  – (or maybe mediated by other variables)
• Indirectly – X affects Y via a mediating
  variable - M


                                             657
• e.g. how does going to the pub effect
  comprehension on a Summer school
  course
  – on, say, regression
                  not reading
                   books on
                  regression
  Having fun
                                    less
   in pub in
                                 knowledge
    evening


                Anything
                 here?
                                             658
             not reading
              books on
             regression
Having fun
                              less
 in pub in
                           knowledge
  evening

               fatigue




     Still
   needed?
                                       659
• Mediators needed
  – to cope with more sophisticated theory in
    social sciences
  – make explicit assumptions made about
    processes
  – examine direct and indirect influences




                                                660
Detecting Mediation




                      661
                 4 Steps
From Baron and Kenny (1986)
• To establish that the effect of X on Y is
   mediated by M
1. Show that X predicts Y
2. Show that X predicts M
3. Show that M predicts Y, controlling for X
4. If effect of X controlling for M is zero, M
   is complete mediator of the relationship
  •   (3 and 4 in same analysis)
                                          662
Example: Book habits

     Enjoy Books
          
      Buy books
          
     Read Books

                       663
          Three Variables
• Enjoy
  – How much an individual enjoys books
• Buy
  – How many books an individual buys (in a
    year)
• Read
  – How many books an individual reads (in a
    year)

                                              664
        ENJOY BUY      READ
ENJOY       1.00  0.64     0.73
BUY         0.64  1.00     0.75
READ        0.73  0.75     1.00




                                  665
• The Theory

   enjoy       buy   read




                            666
• Step 1
1. Show that X (enjoy) predicts Y (read)
  – b1 = 0.487, p < 0.001
  – standardised b1 = 0.732
  – OK




                                       667
2. Show that X (enjoy) predicts M (buy)
  – b1 = 0.974, p < 0.001
  – standardised b1 = 0.643
  – OK




                                      668
3. Show that M (buy) predicts Y (read),
   controlling for X (enjoy)
  – b1 = 0.469, p < 0.001
  – standardised b1 = 0.206
  – OK




                                          669
4. If effect of X controlling for M is zero,
   M is complete mediator of the
   relationship
  – (Same as analysis for step 3.)
  – b2 = 0.287, p = 0.001
  – standardised b2 = 0.431
  – Hmmmm…
     •   Significant, therefore not a complete mediator



                                                    670
                        0.287
                       (step 4)




      enjoy            read
                buy


                          0.206
    0.974
                      (from step 3)
(from step 2)


                                      671
   The Mediation Coefficient
• Amount of mediation =
              Step 1 – Step 4
              =0.487 – 0.287
                  = 0.200
• OR
              Step 2 x Step 3
              =0.974 x 0.206
                  = 0.200
                                672
          SE of Mediator
 enjoy              buy             read
                     a               b
               (from step 2)   (from step 2)


• sa = se(a)
• sb = se(b)

                                               673
• Sobel test
  – Standard error of mediation coefficient can
    be calculated



se  b s + a s - s s
               2 2
                 a
                           2 2
                             b
                                       2 2
                                       a b
 a = 0.974              b = 0.206
  sa = 0.189             sb = 0.054


                                             674
• Indirect effect = 0.200
  – se = 0.056
  – t =3.52, p = 0.001
• Online Sobel test:
http://www.unc.edu/~preacher/sobel/
  sobel.htm
  – (Won‟t be there for long; probably will be
    somewhere else)




                                                 675
             A Note on Power
• Recently
  – Move in methodological literature away from this
    conventional approach
  – Problems of power:
  – Several tests, all of which must be significant
     • Type I error rate = 0.05 * 0.05 = 0.0025
     • Must affect power
  – Bootstrapping suggested as alternative
     • See Paper B7, A4, B9
     • B21 for SPSS syntax
                                                   676
677
678
Lesson 13: Moderators in
      Regression
  “different slopes for different
               folks”



                                    679
             Introduction
• Moderator relationships have many
  different names
  – interactions (from ANOVA)
  – multiplicative
  – non-linear (just confusing)
  – non-additive
• All talking about the same thing

                                      680
A moderated relationship occurs
• when the effect of one variable
  depends upon the level of another
  variable




                                      681
• Hang on …
  – That seems very like a nonlinear relationship
  – Moderator
     • Effect of one variable depends on level of another
  – Non-linear
     • Effect of one variable depends on level of itself
• Where there is collinearity
  – Can be hard to distinguish between them
  – Paper in handbook (B5)
  – Should (usually) compare effect sizes



                                                            682
• e.g. How much it hurts when I drop a
  computer on my foot depends on
  – x1: how much alcohol I have drunk
  – x2: how high the computer was dropped
    from
  – but if x1 is high enough
  – x2 will have no effect


                                            683
• e.g. Likelihood of injury in a car
  accident
  – depends on
  – x1: speed of car
  – x2: if I was wearing a seatbelt
  – but if x1 is low enough
  – x2 will have no effect




                                       684
         30




         25




         20
Injury



         15




         10




          5




          0




              5   15         25      35         45
                       Speed (mph)

                  Seatbelt        No Seatbelt


                                                     685
• e.g. number of words (from a list) I can
  remember
  – depends on
  – x1: type of words (abstract, e.g. „justice‟, or
    concrete, e.g. „carrot‟)
  – x2: Method of testing (recognition – i.e.
    multiple choice, or free recall)
  – but if using recognition
  – x1: will not make a difference


                                                 686
• We looked at three kinds of moderator
• alcohol x height = pain
  – continuous x continuous
• speed x seatbelt = injury
  – continuous x categorical
• word type x test type
  – categorical x categorical
• We will look at them in reverse order



                                          687
 How do we know to look for
       moderators?
Theoretical rationale
• Often the most powerful
• Many theories predict additive/linear
  effects
  – Fewer predict moderator effects
Presence of heteroscedasticity
• Clue there may be a moderated
  relationship missing                    688
Two Categorical Predictors




                             689
• 2 IVs               Data
  – word type (concrete [1], abstract [2])
  – test method (recog [1], recall [2])
• 20 Participants in one of four groups
  –   1,   1
  –   1,   2
  –   2,   1
  –   2,   2
• 5 per group
• lesson12.1.sav

                                             690
                    Concrete Abstract Total
         Mean           15.40    15.20    15.30
Recog
         SD              2.19     2.59      2.26
         Mean           15.60     6.60    11.10
Recall
         Std. Deviation 1.67      7.44      6.95
         Mean           15.50    10.90    13.20
Total
         Std. Deviation 1.84      6.94      5.47




                                              691
• Graph of means
 18



 16



 14



 12



 10
                     WORDS
  8
                          1.00

  6                       2.00
  1.00             2.00


      TEST
                                 692
          ANOVA Results
• Standard way to analyse these data
  would be to use ANOVA
  – Words: F=6.1, df=1, 16, p=0.025
  – Test: F=5.1, df=1, 16, p=0.039
  – Words x Test: F=5.6, df=1, 16, p=0.031




                                             693
      Procedure for Testing
1: Convert to effect coding
• can use dummy coding, collinearity is
  less of an issue
• doesn‟t make any difference to
  substantive interpretation
2: Calculate interaction term
• In ANOVA interaction is automatic
• In regression we create an interaction
  variable                                 694
• Interaction term (wxt)
  – multiply effect coded variables together


      word           test            wxt
       -1             -1              1
        1             -1             -1
       -1             1              -1
        1             1               1

                                               695
3: Carry out regression
• Hierarchical
  – linear effects first
  – interaction effect in next block




                                       696
• b0=13.2
• b1 (words) = -2.3, p=0.025
• b2 (test) = -2.1, p=0.039
• b3 (words x test) = -2.2, p=0.031
• Might need to use change in R2 to test
  sig of interaction, because of collinearity
What do these mean?
• b0 (intercept) = predicted value of Y
  (score) when all X = 0
    – i.e. the central point

                                           697
• b0 = 13.2
  – grand mean
• b1 = -2.3
  – distance from grand to mean for two word
    types
  – 13.2 – (-2.3) = 15.5
  – 13.2 + (-2.3) = 10.9

              Concrete Abstract Total
     Recog       15.40     15.20    15.30
     Recall      15.60      6.60    11.10
      Total      15.50     10.90    13.20
                                               698
• b2 = -2.1
  – distance from grand mean to recog and
    recall means
• b3 = -2.2
  – to understand b3 we need to look at
    predictions from the equation without this
    term
Score = 13.2 + (-2.3)  w + (-2.1)  t




                                             699
   Score = 13.2 + (-2.3)  w + (-2.1)  t
• So for each group we can calculate an
  expected value




                                        700
    b1 = -2.3, b2 = -2.1


W    T     Word   Test            Expected Value


C   Cog     -1     -1    13.2 + (-2.3) x (-1) + (-2.1) x -1


C   Call    -1     1       13.2 + (-2.3) x (-1) + (-2.1) x 1


A   Cog     1      -1      13.2 + (-2.3) x 1 + (-2.1) x (-1)


A   Call    1      1        13.2 + (-2.3) x 1 + (-2.1) x 1


                                                        701
  W   T    Word Test Exp      Actual Value
  C   Call   -1 -1       17.6         15.4
  C   Cog    -1   1      13.4         15.6
  A   Call    1 -1       13.0         15.2
  A   Cog     1   1       8.8         11.0



• The exciting part comes when we look
  at the differences between the actual
  value and the value in the 2 IV model
                                          702
• Each difference = 2.2 (or –2.2)
• The value of b3 was –2.2
  – the interaction term is the correction
    required to the slope when the second IV
    is included




                                               703
• Examine the slope for word type

 18
 16
 14
 12
 10
  8
                   Gradient =
  6            (11.1 - 15.3) / 2 = -
  4                     2.1
  2
  0
Recog (-1)                             Recall (1)

                  Test Type

                                           704
• Add the slopes for two test groups

 18
 16
 14
 12
 10           Both word
  8          groups (-2.1)
  6
  4                                        Concrete
                  Abstract               (15.6-15.4 )/2
  2
               (6.6 - 15.2 )/2                = 0.1
  0                = -4.3
Recog (-1)                                                Recall (1)

                             Test Type                     705
b associated with interaction
• the change in slope, away from the
  average, associated with a 1 unit
  change in the moderating variable
OR
• Half the difference in the slopes




                                       706
• Another way to look at it
      Y = 13.2 + -2.3w + -2.1t + -2.2wt
• Examine concrete words group (w = -1)
  – substitute values into the equation

 Y(concrete) = 13.2 + -2.3-1 + -2.1t + -2.2-1t
      Y(concrete) = 13.2 + 2.3 + -2.1t + 2.2t
                Y(concrete) = 15.5 + 0.1t
• The effect of changing test type for concrete
  words (the slope, which is half the actual
  difference)
                                            707
Why go to all that effort? Why not do
   ANOVA in the first place?
1. That is what ANOVA actually does
  •   if it can handle an unbalanced design (i.e.
      different numbers of people in each
      group)
  •   Helps to understand what can be done
      with ANOVA
  •   SPSS uses regression to do ANOVA
2. Helps to clarify more complex cases
  •   as we shall see

                                               708
Categorical x Continuous




                           709
    Note on Dichotomisation
• Very common to see people dichotomise
  a variable
  – Makes the analysis easier
  – Very bad idea
    • Paper B6




                                     710
                  Data
A chain of 60 supermarkets
• examining the relationship between
  profitability, shop size, and local
  competition
• 2 IVs
  – shop size
  – comp (local competition, 0=no, 1=yes)
• DV
  – profit
                                            711
• Data, „lesson 12.2.sav‟
    Shopsize   Comp       Profit
           4          1         23
          10          1         25
           7          0         19
          10          0          9
          10          1         18
          29          1         33
          12          0         17
           6          1         20
          14          0         21
          62          0          8
                                     712
              1st Analysis
Two IVs
• R2=0.367, df=2, 57, p < 0.001
• Unstandardised estimates
  – b1 (shopsize) = 0.083 (p=0.001)
  – b2 (comp) = 5.883 (p<0.001)
• Standardised estimates
  – b1 (shopsize) = 0.356
  – b2 (comp) = 0.448
                                      713
• Suspicions
  – Presence of competition is likely to have an
    effect
  – Residual plot shows a little
    heteroscedasticity
  3



  2



  1



  0



  -1



  -2



  -3
   -2.0   -1.5   -1.0   -.5   0.0   .5   1.0   1.5   2.0



                                                           714
      Procedure for Testing
• Very similar to last time
  – convert „comp‟ to effect coding
  – -1 = No competition
  – 1 = competition
  – Compute interaction term
     • comp (effect coded) x size
  – Hierarchical regression

                                      715
                  Result
• Unstandardised estimates
  – b1 (shopsize) = 0.071 (p=0.006)
  – b2 (comp) = -1.67 (p = 0.506)
  – b3 (sxc) = -0.050 (p=0.050)
• Standardised estimates
  – b1 (shopsize) = 0.306
  – b2 (comp) = -0.127
  – b3 (sxc) = -0.389
                                      716
• comp now non-significant
  – shows importance of hierarchical
  – it obviously is important




                                       717
             Interpretation
• Draw graph with lines of best fit
  – drawn automatically by SPSS
• Interpret equation by substitution of
  values
  – evaluate effects of
     • size
     • competition


                                          718
         40




         30




         20




         10
                                              Competition

                                              No competition
Profit




          0                                   All Shops
              0     20   40   60   80   100


              Shopsize


                                                               719
• Effects of size
  – in presence and absence of competition
  – (can ignore the constant)
  Y=x10.071 + x2(-1.67) + x1x2 (-0.050)
  – Competition present (x2 = 1)
  Y=x10.071 + 1(-1.67) + x11 (-0.050)
  Y=x10.071 + -1.67 + x1(-0.050)
  Y=x1 0.021                 + (–1.67)




                                              720
Y=x10.071 + x2(-1.67) + x1x2 (-0.050)
– Competition absent (x2 = -1)
Y=x10.071 + -1(-1.67) + x1-1 (-0.050)
Y=x1 0.071 + x1-1 (-0.050) + -1(-1.67)
Y= x1 0.121 (+ 1.67)




                                            721
Two Continuous Variables




                           722
                    Data
• Bank Employees
  – only using clerical staff
  – 363 cases
  – predicting starting salary
  – previous experience
  – age
  – age x experience

                                 723
 • Correlation matrix
   – only one significant

        LOGSB AGESTART PREVEXP
LOGSB       1.00 -0.09     0.08
AGESTART   -0.09  1.00     0.77
PREVEXP     0.08  0.77     1.00




                                  724
Initial Estimates (no moderator)
• (standardised)
  – R2 = 0.061, p<0.001
  – Age at start = -0.37, p<0.001
  – Previous experience = 0.36, p<0.001
• Suppressing each other
  – Age and experience compensate for one
    another
  – Older, with no experience, bad
  – Younger, with experience, good

                                            725
            The Procedure
• Very similar to previous
  – create multiplicative interaction term
  – BUT
• Need to eliminate effects of means
  – cause massive collinearity
• and SDs
  – cause one variable to dominate the
    interaction term
• By standardising
                                             726
• To standardise x,
  – subtract mean, and divide by SD
  – re-expresses x in terms of distance from
    the mean, in SDs
  – ie z-scores
• Hint: automatic in SPSS in Descriptives
• Create interaction term of age and exp
  – axe = z(age)  z(exp)



                                               727
• Hierarchical regression
  – two linear effects first
  – moderator effect in second
  – hint: it is often easier to interpret if
    standardised versions of all variables are
    used




                                                 728
• Change in R2
  – 0.085, p<0.001
• Estimates (standardised)
  – b1 (exp) = 0.104
  – b2 (agestart) = -0.54
  – b3 (age x exp) = -0.54




                             729
 Interpretation 1: Pick-a-Point
• Graph is tricky
  – can‟t have two continuous variables
  – Choose specific points (pick-a-point)
     • Graph the line of best fit of one variable at
       others
  – Two ways to pick a point
     • 1: Choose high (z = +1), medium (z = 0) and
       low (z = -1)
     • Choose „sensible‟ values – age 20, 50, 80?

                                                       730
• We know:
  – Y = e  0.10 + a  -0.54 + a  e  -0.54
  – Where a = agestart, and e = experience
• We can rewrite this as:
  – Y = (e  0.10) + (a  -0.54) + (a  e  -0.54)
  – Take a out of the brackets
  – Y = (e  0.10) + (-0.54 + e  -0.54)a
• Bracketed terms are simple intercept and simple
  slope
  – 0= (e  0.10)
  – 1= (-0.54 + e  -0.54)a
  – Y = 0 + 1a

                                                     731
• Pick any value of e, and we know the slope
  for a
  – Standardised, so it‟s easy
• e = -1
  – 0= (-1  0.10) = -0.10
  – 1= (-0.54 + -1  -0.54)a = -0.0a
• e=0
  – 0= (0  0.10) = 0
  – 1= (-0.54+ 0  -0.54)a = -0.54a
• e=1
  – 0= (1  0.10) = 0.10
  – 1= (-0.54 + 1  -0.54)a = -1.08a

                                               732
                           Graph the Three Lines
              1.5




               1
                                                                                                                      e = -1
                                                                                                                      e=0
                                                                                                                      e=1
              0.5
Log(salary)




               0




         -0.5




               -1




         -1.5
                    -1   -0.9   -0.8   -0.7   -0.6   -0.5   -0.4   -0.3   -0.2   -0.1    0    0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1
                                                                                        Age
                                                                                                                                                    733
Interpretation 2: P-Values and CIs

• Second way
  – Newer, rarely done
• Calculate CIs of the slope
  – At any point
• Calculate p-value
  – At any point
• Give ranges of significance

                                 734
        What do you need?
• The variance and covariance of the
  estimates
  – SPSS doesn‟t provide estimates for
    intercept
  – Need to do it manually
• In options, exclude intercept
  – Create intercept – c = 1
  – Use it in the regression


                                         735
• Enter information into web page:
  – www.unc.edu/~preacher/interact/a
    cov.htm
  – (Again, may not be around for long)
• Get results
• Calculations in Bauer and Curran (in
  press: Multivariate Behavioral Research)
  – Paper B13



                                          736
                             MLR 2-Way Interaction Plot



    4.5
    4.4
    4.3
Y

    4.2
    4.1




                 CVz1(1)
                 CVz1(2)
                 CVz1(3)
    4.0




          -1.0             -0.5         0.0           0.5   1.0

                                         X
                                                                  737
                      Areas of Significance
                                 Confidence Bands
               0.4
               0.2
Simple Slope

               0.0
               -0.2
               -0.4
               -0.6




                       -4   -2          0           2   4

                                    Experience


                                                            738
• 2 complications
  – 1: Constant differed
  – 2: DV was logged, hence non-linear
    • effect of 1 unit depends on where the unit is
  – Can use SPSS to do graphs showing lines
    of best fit for different groups
  – See paper A2




                                                      739
Finally …




            740
      Unlimited Moderators
• Moderator effects are not limited to
  – 2 variables
  – linear effects




                                         741
  Three Interacting Variables
• Age, Sex, Exp
• Block 1
  – Age, Sex, Exp
• Block 2
  – Age x Sex, Age x Exp, Sex x Exp
• Block 3
  – Age x Sex x Exp

                                      742
• Results
  – All two way interactions significant
  – Three way not significant
  – Effect of Age depends on sex
  – Effect of experience depends on sex
  – Size of the age x experience interaction
    does not depend on sex (phew!)




                                               743
      Moderated Non-Linear
         Relationships

• Enter non-linear effect
• Enter non-linear effect x moderator
  – if significant indicates degree of non-
    linearity differs by moderator




                                              744
745
Modelling Counts: Poisson
       Regression
         Lesson 14




                            746
          Counts and the Poisson
               Distribution
• Von Bortkiewicz
  (1898)
  – Numbers of Prussian
                          120
    soldiers kicked to    100
    death by horses       80

                          60
      0   109
                          40
      1   65
                          20
      2   22
      3   3                0

      4   1                     0   1   2   3   4   5

      5   0

                                                        747
• The data fitted a Poisson probability distribution
   – When counts of events occur, poisson distribution is
     common
   – E.g. papers published by researchers, police arrests,
     number of murders, ship accidents
• Common approach
   – Log transform and treat as normal
• Problems
   – Censored at 0
   – Integers only allowed
   – Heteroscedasticity


                                                        748
                    The Poisson Distribution
              0.7



              0.6



              0.5
                                                                                     0.5
                                                                                     1
              0.4                                                                    4
Probability




                                                                                     8


              0.3



              0.2



              0.1



               0
                    0   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15    16   17
                                                    Count
                                                                                                     749
             exp(   )    y
p ( y | x) 
                  y!



                                750
               exp(   )                      y
  p ( y | x) 
                    y!
• Where:
  – y is the count
  –  is the mean of the poisson distribution
• In a poisson distribution
  – The mean = the variance (hence
    heteroscedasticity issue))
  –   2
                                                    751
  Poisson Regression in SPSS
• Not directly available
   – SPSS can be tweaked to do it in three ways:
   – General loglinear model (genlog)
   – Non-linear regression (CNLR)
      • Bootstrapped p-values only


   – Both are quite tricky
• SPSS 15,

                                                   752
      Example Using Genlog
• Number of shark               25



  bites on different            20
                                                                       Blue

  colour surfboards             15
                                                                       Red




                         Frequency
  – 100 surfboards, 50
    red, 50 blue                10



• Weight cases by                    5


  bites                              0


• Analyse, Loglinear,
                                         0   1        2            3          4
                                                 Number of bites



  General
  – Colour is factor                                                          753
               Results
Correspondence Between Parameters and
  Terms of the Design
Parameter   Aliased Term

1    Constant
2    [COLOUR = 1]
3 x [COLOUR = 2]
Note: 'x' indicates an aliased (or a
  redundant) parameter. These parameters
  are set to zero.

                                        754
Asymptotic                             95% CI
Param   Est.         SE      Z-value   Lower    Upper

1          4.1190   .1275    32.30     3.87     4.37
2          -.5495   .2108    -2.61     -.96     -.14
3           .0000   .          .        .        .

    • Note: Intercept
      (param 1) is curious
    • Param 2 is the
      difference in the
      means
                                                 755
 SPSS: Continuous Predictors
• Bleedin‟ nightmare
• http://www.spss.com/tech/answer/deta
  ils.cfm?tech_tan_id=100006204




                                    756
  Poisson Regression in Stata
• SPSS will save a Stata file
• Open it in Stata
• Statistics, Count outcomes, Poisson
  regression




                                        757
     Poisson Regression in R
• R is a freeware program
  – Similar to SPlus
  – www.r-project.org
• Steep learning curve to start with
• Much nicer to do Poisson (and other) regression
  analysis
http://www.stat.lsa.umich.edu/~faraway/book
  /
http://www.jeremymiles.co.uk/regressionbook
  /extras/appendix2/R/
                                             758
• Commands in R
• Stage 1: enter data
  – colour <- c(1, 0, 1, 0, 1, 0 … 1)
  – bites <- c(3, 1, 0, 0, … )
• Run analysis
  – p1 <- glm(bites ~ colour, family
   = poisson)
• Get results
  – summary.glm(p1)


                                   759
                   R Results
Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3567      0.1686 -2.115 0.03441 *
colour        0.5555     0.2116   2.625 0.00866 **


• Results for colour
  – Same as SPSS
  – For intercept different (weird SPSS)

                                                 760
          Predicted Values
• Need to get exponential of parameter
  estimates
  – Like logistic regression
• Exp(0.555) = 1.74
  – You are likely to be bitten by a shark 1.74
    times more often with a red surfboard



                                              761
      Checking Assumptions
• Was it really poisson distributed?
  – For Poisson,   2
     • As mean increases, variance should also
       increase
  – Residuals should be random
     • Overdispersion is common problem
     • Too many zeroes
• For blue:   2 = exp(-0.3567) = 1.42
• For red:   2 = exp(-0.3567 + 0.555)
  = 2.48                                762
               exp(   )    y
  p ( y | x) 
                    y!

• Strictly:
               exp(   ) 
                      ˆ ˆ     y
p( yi | xi ) 
                    y!




                                  763
                  Compare Predicted with Actual
                         Distributions

                              Blue                                                    Red

         0.7                                                        0.4

         0.6                                                   0.35

                                                                    0.3
         0.5                               Expected
                                                               0.25
Probability




                                                      Probability
                                           Actual
         0.4
                                                                    0.2
         0.3
                                                               0.15
                                                                              Expected
         0.2                                                                  Actual
                                                                    0.1
         0.1                                                   0.05

              0                                                      0
                   0   1       2       3         4                        1    2               3   4
                           Frequency                                               Frequency




                                                                                                       764
             Overdispersion
• Problem in poisson regression
  – Too many zeroes
• Causes
  – c2 inflation
  – Standard error deflation
     • Hence p-values too low
  – Higher type I error rate
• Solution
  – Negative binomial regression

                                   765
                 Using R
• R can read an SPSS file
  – But you have to ask it nicely
• Click Packages menu, Load package,
  choose “Foreign”
• Click File, Change Dir
  – Change to the folder that contains your
    data

                                              766
                   More on R
• R uses objects
  – To place something into an object use <-
  – X <- Y
     • Puts Y into X
• Function is read.spss()
  – Mydata <- read.spss(“spssfilename.sav”)
• Variables are then referred to as
  Mydata$VAR1
  – Note 1: R is case sensitive
  – Note 2: SPSS variable name in capitals
                                               767
                 GLM in R
• Command
  – glm(outcome ~ pred1 + pred2 + … +
    predk [,family = familyname])
  – If no familyname, default is OLS
    • Use binomial for logistic, poisson for poisson
• Output is a GLM object
  – You need to give this a name
  – my1stglm <- glm(outcome ~ pred1 +
    pred2 + … + predk [,family =
    familyname])                                       768
• Then need to explore the result
  – summary(my1stglm)
• To explore what it means
  – Need to plot regressions
     • Easiest is to use Excel




                                    769
770
Introducing Structural
 Equation Modelling
       Lesson 15




                         771
             Introduction
• Related to regression analysis
  – All (OLS) regression can be considered as a
    special case of SEM
• Power comes from adding restrictions
  to the model
• SEM is a system of equations
  – Estimate those equations

                                             772
        Regression as SEM
• Grades example
  – Grade = constant + books + attend +
    error
    • Looks like a regression equation
  – Also
  – Books correlated with attend
  – Explicit modelling of error


                                          773
           Path Diagram
• System of equations are usefully
  represented in a path diagram

    x      Measured variable

    e     unmeasured variable

             regression


            correlation
                                     774
Path Diagram for Regression
                 Must usually
                  explicitly
                 model error
                                        error

      Books

                                Grade

      Attend



    Must explicitly
   model correlation
                                                775
                     Results
• Unstandardised

             2.00                               1.00
                                            e
        BOOKS
                            4.04    13.52


 2.65                              GRADE
             17.84   1.28

        ATTEND



                                                       776
           Standardised


                                   e
      BOOKS
                     .35     .82


.44                        GRADE
               .33

      ATTEND




                                       777
                                  Table
                         Estimate       S.E.         C.R.          P        St. Est.
GRADE     <-- BOOKS           4.04         1.71        2.36            0.02      0.35
GRADE     <-- ATTEND          1.28         0.57        2.25            0.03      0.33
GRADE     <-- e              13.52         1.53        8.83            0.00      0.82
GRADE                        37.38         7.54        4.96            0.00


                               Coefficientsa

                          Unstandardized          Standardized
                           Coefficients           Coefficients
  Model                   B        Std. Error         Beta             Sig.
  1       (Constant)     37.38          7.74                              .00
          BOOKS            4.04         1.75                 .35          .03
          ATTEND           1.28            .59               .33          .04
    a. Dependent Variable: GRADE                                                778
    So What Was the Point?
• Regression is a special case
• Lots of other cases
• Power of SEM
  – Power to add restrictions to the model
• Restrict parameters
  – To zero
  – To the value of other parameters
  – To 1
                                             779
              Restrictions
• Questions
  – Is a parameter really necessary?
  – Are a set of parameters necessary?
  – Are parameters equal
• Each restriction adds 1 df
  – Test of model with c2


                                         780
               The c2 Test
• Can the model proposed have
  generated the data?
  – Test of significance of difference of model
    and data
  – Statistically significant result
    • Bad
  – Theoretically driven
    • Start with model
    • Don‟t start with data

                                              781
         Regression Again
                              0, 1
                          e
     BOOKS


                      GRADE

    ATTEND


• Both estimates restricted to zero


                                      782
• Two restrictions
  – 2 df for c2 test
  – c2 = 15.9, p = 0.0003
• This test is (asymptotically) equivalent
  to the F test in regression
  – We still haven‟t got any further




                                             783
Multivariate Regression

            y1
 x1
            y2

 x2
            y3




                          784
     Test of all x’s on all y’s
      (6 restrictions = 6 df)



                          y1
x1
                          y2

x2
                          y3




                                  785
     Test of all x1 on all y’s
         (3 restrictions)



                          y1
x1
                          y2

x2
                          y3




                                 786
     Test of all x1 on all y1
        (3 restrictions)



                         y1
x1
                         y2

x2
                         y3




                                787
Test of all 3 partial correlations between
          y’s, controlling for x’s
              (3 restrictions)


                             y1
     x1
                             y2

     x2
                             y3




                                             788
      Path Analysis and SEM
• More complex        ENJOY
  models – can add
  more restrictions
                                        1
  – E.g. mediator
                              BUY   e_buy
    model
• 1 restriction
                                        1
  – No path from
                      READ          e_read
    enjoy -> read

                                         789
                  Result
• c2 = 10.9, 1 df, p = 0.001
• Not a complete mediator
  – Additional path is required




                                  790
           Multiple Groups
• Same model
  – Different people
• Equality constraints between groups
  – Means, correlations, variances, regression
    estimates
  – E.g. males and females



                                             791
    Multiple Groups Example
• Age
• Severity of psoriasis
  – SEVE – in emotional areas
     • Hands, face, forearm
  – SEVNONE – in non-emotional areas
  – Anxiety
  – Depression

                                       792
                                           Correlationsa

                                     AGE          SEVE         SEVNONE    GHQ_A       GHQ_D
AGE            Pearson Correlation         1        -.270         -.248     .017        .035
               Sig. (2-tailed)              .        .004         .009      .859        .717
               N                       110           110           110       110         110
SEVE           Pearson Correlation    -.270                1      .665      .045        .075
               Sig. (2-tailed)        .004                 .      .000      .639        .436
               N                       110           110           110       110         110
SEVNONE        Pearson Correlation    -.248          .665            1      .109        .096
               Sig. (2-tailed)        .009           .000             .     .255        .316
               N                       110           110           110       110         110
GHQ_A          Pearson Correlation    .017           .045         .109         1        .782
               Sig. (2-tailed)        .859           .639         .255            .     .000
               N                       110           110           110       110         110
GHQ_D          Pearson Correlation    .035           .075         .096      .782           1
               Sig. (2-tailed)        .717           .436         .316      .000              .
               N                       110           110           110       110         110
  a. SEX = f

                                                                                        793
                                       Correlationsa

                                 AGE          SEVE         SEVNONE    GHQ_A       GHQ_D
AGE        Pearson Correlation         1        -.243         -.116     -.195       -.190
           Sig. (2-tailed)              .        .031         .310      .085        .094
           N                           79          79           79        79          79
SEVE       Pearson Correlation    -.243                1      .671      .456        .453
           Sig. (2-tailed)        .031                 .      .000      .000        .000
           N                           79          79           79        79          79
SEVNONE    Pearson Correlation    -.116          .671            1      .210        .232
           Sig. (2-tailed)        .310           .000             .     .063        .040
           N                           79          79           79        79          79
GHQ_A      Pearson Correlation    -.195          .456         .210         1        .800
           Sig. (2-tailed)        .085           .000         .063            .     .000
           N                           79          79           79        79          79
GHQ_D      Pearson Correlation    -.190          .453         .232      .800           1
           Sig. (2-tailed)        .094           .000         .040      .000              .
           N                           79          79           79        79          79
  a. SEX = m

                                                                                    794
        Model
             AGE




      SEVE         SEVNONE
  1                            1
e_s                          e_sn




      Dep            Anx
  1                            1
E_d                          e_a




                                    795
Females
                                 AGE


                         -.27               -.25




                  SEVE                     SEVNONE
            .96          .07               .04               .97

          e_s                                                e_sn

                   .03          .09 -.04               .15



                                 .64


                  Dep                            Anx
            .99                                              .99

          E_d                                                 e_a


                                 .78
                                                                    796
                                AGE
Males                  -.24               -.12




                SEVE                     SEVNONE
          .97          -.08              -.08            .99

        e_s                                              e_sn

                 .52          -.12 .55            -.17



                                .67


                Dep                         Anx
          .88                                            .88

        E_d                                               e_a


                                .74
                                                                797
               Constraint
• sevnone -> dep
  – Constrained to be equal for males and
    females
• 1 restriction, 1 df
  – c2 = 1.3 – not significant
• 4 restrictions
  – 2 severity -> anx & dep

                                            798
• 4 restrictions, 4 df
  – c2 = 1.3, p = 0.014
• Parameters are not equal




                             799
Missing Data: The big advantage

• SEM programs tend to deal with missing
  data
  – Multiple imputation
  – Full Information (Direct) Maximum
    Likelihood
    • Asymptotically equivalent
• Data can be MAR, not just MCAR

                                        800
 Power: A Smaller Advantage
• Power for regression gets tricky with
  large models
• With SEM power is (relatively) easy
  – It‟s all based on chi-square
  – Paper B14




                                          801
Lesson 16: Dealing with clustered
   data & longitudinal models




                               802
          The Independence
             Assumption
• In Lesson 8 we talked about independence
  – The residual of any one case should not tell you
    about the residual of any other case
• Particularly problematic when:
  – Data are clustered on the predictor variable
     • E.g. predictor is household size, cases are members of
       family
     • E.g. Predictor is doctor training, outcome is patients of
       doctor
  – Data are longitudinal
     • Have people measured over time
         – It‟s the same person!
                                                                   803
         Clusters of Cases
• Problem with cluster (group)
  randomised studies
  – Or group effects
• Use Huber-White sandwich estimator
  – Tell it about the groups
  – Correction is made
  – Use complex samples in SPSS

                                       804
           Complex Samples
• As with Huber-White for heteroscedasticity
  – Add a variable that tells it about the clusters
  – Put it into clusters
• Run GLM
  – As before
• Warning:
  – Need about 20 clusters for solutions to be stable



                                                        805
                    Example
• People randomised by week to one of two
  forms of triage
   – Compare the total cost of treating each
• Ignore clustering
   – Difference is £2.40 per person, with 95%
     confidence intervals £0.58 to £4.22, p =0.010
• Include clustering
   – Difference is still £2.40, with 95% CIs £5.65 to -
     £0.85, and p = 0.141.
• Ignoring clustering led to type I error
                                                          806
      Longitudinal Research
• For comparing
  repeated measures       ID   V1   V2   V3   V4
  – Clusters are people   1    2    3    4    7
  – Can model the         2    3    6    8    4
    repeated measures
    over time             3    2    5    7    5
• Data are usually
  short and fat


                                              807
           Converting Data
                        ID   V   X
                        1    1   2
• Change data to tall   1    2   3
  and thin              1    3   4
                        1    4   7
• Use Data,             2    1   3
  Restructure in        2    2   6
  SPSS                  2    3   8
                        2    4   4
• Clusters are ID
                        3    1   2
                        3    2   5
                        3    3   7
                        3    4   5

                                     808
         (Simple) Example
• Use employee data.sav
  – Compare beginning salary and salary
  – Would normally use paired samples t-test
• Difference = $17,403, 95% CIs
  $16,427.407, $18,379.555



                                               809
       Restructure the Data
• Do it again                 ID       Time   Cash
  – With data tall and thin        1     1    $18,750
• Complex GLM with                 1     2    $21,450
  Time as factor                   2     1    $12,000

  – ID as cluster                  2     2    $21,900

• Difference = $17,430,            3     1    $13,200

  95% CIs =                        3     2    $45,000

  16427.407, 18739.555

                                                 810
            Interesting …
• That wasn‟t very interesting
  – What is more interesting is when we have
    multiple measurements of the same people
• Can plot and assess trajectories over
  time



                                          811
Single Person Trajectory


                         +

               +     +
      +
          +
  +




              Time
                             812
Multiple Trajectories: What‟s the
         Mean and SD?




              Time
                                813
      Complex Trajectories
• An event occurs
  – Can have two effects:
  – A jump in the value
  – A change in the slope
• Event doesn‟t have to happen at the
  same time for each person
  – Doesn‟t have to happen at all

                                        814
Slope 1                  Jump




                                Slope 2




          Event Occurs


                                          815
        Parameterising
Time   Event   Time2   Outcome
  1      0       0        12
  2      0       0        13
  3      0       0        14
  4      0       0        15
  5      0       0        16
  6      1       0        10
  7      1       1         9
  8      1       2         8
  9      1       3         7     816
 Draw the Line




What are the parameter estimates?
                                    817
 Main Effects and Interactions


• Main effects
  – Intercept differences
• Moderator effects
  – Slope differences


                             818
           Multilevel Models
• Fixed versus random effects
  – Fixed effects are fixed across individuals
    (or clusters)
  – Random effects have variance
• Levels
  – Level 1 – individual measurement
    occasions
  – Level 2 – higher order clusters
                                                 819
           More on Levels
• NHS direct study
  – Level 1 units: …………….
  – Level 2 units: ……………
• Widowhood food study
  – Level 1 units ……………
  – Level 2 units ……………




                            820
           More Flexibility
• Three levels:
  – Level 1: measurements
  – Level 2: people
  – Level 3: schools




                              821
               More Effects
• Variances and covariances of effects
• Level 1 and level 2 residuals
  – Makes R2 difficult to talk about
• Outcome variable
  – Yij
     • The score of the ith person in the jth group



                                                      822
 Y    i   j
2.3   1   1
3.2   2   1
4.5   3   1
4.8   1   2
7.2   2   2
3.1   3   2
1.6   4   2



              823
                Notation
• Notation gets a bit horrid
  – Varies a lot between books and programs
• We used to have b0 and b1
  – If fixed, that‟s fine
  – If random, each person has their own
    intercept and slope



                                              824
           Standard Errors
• Intercept has standard errors
• Slopes have standard errors
• Random effects have variances
  – Those variances have standard errors
    • Is there statistically significant variation
      between higher level units (people)?
    • OR
    • Is everyone the same?

                                                     825
                Programs
• Since version 12
  – Can do this in SPSS
  – Can‟t do anything really clever
• Menus
  – Completely unusable
  – Have to use syntax


                                      826
           SPSS Syntax
• MIXED
• relfd with time
• /fixed = time
• /random = intercept time | subject
  (id)   covtype(un)
• /print = solution.

                                       827
               SPSS Syntax
• MIXED
• relfd with time

                    Continuous
     Outcome         predictor




                                 828
           SPSS Syntax
• MIXED
• relfd with time
• /fixed = time


              Must specify effect as
                    fixed first




                                       829
               SPSS Syntax
• MIXED
• relfd with time
• /fixed = time
• /random = intercept time | subject
  (id)   covtype(un) are random
                    Intercept and
                  time
    Specify random
       effects
                     SPSS assumes that your
                     level 2 units are subjects,
                     and needs to know the id
                              variable             830
           SPSS Syntax
• MIXED
• relfd with time
• fixed = time
• /random = intercept time | subject
  (id) covtype(un)
                     Covariance matrix of random
                         effects is unstructured.
                    (Alternative is id – identity or vc
                        – variance components).
                                                          831
           SPSS Syntax
• MIXED
• relfd with time
• fixed = time
• /random = intercept time | subject
  (id) covtype(un)
• /print = solution.
                           Print the answer


                                              832
                        The Output
• Information criteria
  – We‟ll come back
                             a
         Information Crite ria

    -2 Restricted Log
                          64899.758
    Likelihood
    Akaike's Information
                         64907.758
    Criterion (AIC)
    Hurvich and Tsai's
                          64907.763
    Criterion (AICC)
    Bozdogan's Criterion
                         64940.134
    (CAIC)
    Schwarz's Bayesian
                       64936.134
    Criterion (BIC)
    The information criteria are displayed in smaller-is-better forms.
      a. Dependent Variable: relfd.
                                                                         833
                      Fixed Effects
• Not useful here, useful for interactions
                                                a
                  Type III Tests of Fixed Effects

                           Denominator
Source      Numerator df       df               F       Sig.
Intercept              1             741    3251.877      .000
time                   1          741.000       2.550     .111
  a. Dependent Variable: relfd.




                                                                 834
            Estimates of Fixed Effects
   • Interpreted as regression equation
                                                    a
                          Estimate s of Fixe d Effects

                                                            95% Confidence
                                                               Interval
                         Std.                               Lower    Upper
Parameter   Estimate     Error     df       t        Sig.   Bound    Bound
Intercept      21.90     21.90     .38   57.025      .000    21.15    22.66
time             -.06      -.06    .04    -1.597     .111     -.14      .01
  a. Dependent Variable: relfd.




                                                                              835
     Covariance Parameters
                                       a
     Estimate s of Covariance Parame ters

Parameter                   Estimate   Std. Error
Residual                    64.11577 1.0526353
Intercept +     UN (1,1)    85.16791 5.7003732
time [subject   UN (2,1)    -4.53179   .5067146
= id]
                UN (2,2)    .7678319   .0636116
  a. Dependent Variable: relfd.


                                                    836
       Change Covtype to VC
• We know that this is wrong
  – The covariance of the effects was statistically
    significant
  – Can also see if it was wrong by comparing
    information criteria
• We have removed a parameter from the
  model
  – Model is worse
  – Model is more parsimonious
     • Is it much worse, given the increase in parsimony?
                                                     837
                                                 VC Model
     UN Model
                         a                                            a
     Information Crite ria                        Information Crite ria

-2 Restricted Log                           -2 Restricted Log
                      64899.758                                   65041.891
Likelihood                                  Likelihood
Akaike's Information                        Akaike's Information
                     64907.758                                   65047.891
Criterion (AIC)                             Criterion (AIC)
Hurvich and Tsai's                          Hurvich and Tsai's
                      64907.763                                   65047.894
Criterion (AICC)                            Criterion (AICC)
Bozdogan's Criterion                        Bozdogan's Criterion
                     64940.134                                   65072.173
(CAIC)                                      (CAIC)
Schwarz's Bayesian                          Schwarz's Bayesian
                   64936.134                                   65069.173
Criterion (BIC)                             Criterion (BIC)
                                            The information criteria are displayed in smaller-
The information criteria are displayed in smaller-is-better forms.
  a. Dependent Variable: relfd.                 a. Dependent Variable: relfd.
               Lower is better.
                                                                                  838
                Adding Bits
• So far, all a bit dull
• We want some more predictors, to make it
  more exciting
  – E.g. female
  – Add:
  Relfd with time female
  /fixed = time sex time * sex
• What does the interaction term represent?

                                              839
           Extending Models
• Models can be extended
  – Any kind of regression can be used
     • Logistic, multinomial, Poisson, etc
  – More levels
     • Children within classes within schools
     • Measures within people within classes within prisons
  – Multiple membership / cross classified models
     • Children within households and classes, but households
       not nested within class
• Need a different program
  – E.g. MlwiN
                                                              840
MlwiN Example (very quickly)




                           841
                  Books
Singer, JD and Willett, JB (2003). Applied
 Longitudinal Data Analysis: Modeling Change
 and Event Occurrence. Oxford, Oxford
 University Press.
Examples at:
http://www.ats.ucla.edu/stat/SPSS/ex
 amples/alda/default.htm



                                             842
The End




          843

								
To top