# Theory of Regression by gi96f4T

VIEWS: 11 PAGES: 843

• pg 1
```									Theory of Regression

1
The Course
• 16 (or so) lessons
– Some flexibility
• Depends how we feel
• What we get through

2
Part I: Theory of Regression
1. Models in statistics
2. Models with more than one parameter:
regression
3. Why regression?
4. Samples to populations
5. Introducing multiple regression
6. More on multiple regression

3
Part 2: Application of regression
7.    Categorical predictor variables
8.    Assumptions in regression analysis
9.    Issues in regression analysis
10.   Non-linear regression
11.   Moderators (interactions) in regression
12.   Mediation and path analysis
Part 3: Advanced Types of Regression
13.   Logistic Regression
14.   Poisson Regression
15.   Introducing SEM
16.   Introducing longitudinal multilevel models
4
House Rules
• Jeremy must remember
– Not to talk too fast
• If you don‟t understand
– Any time
• If you think I‟m wrong
– Ask. (I‟m not always right)

5
Learning New Techniques
• Best kind of data to learn a new technique
– Data that you know well, and understand
– In computer labs (esp later on)
– Use your own data if you like
• My data
– I‟ll provide you with
– Simple examples, small sample sizes
• Conceptually simple (even silly)
6
Computer Programs
• SPSS
– Mostly
• Excel
– For calculations
•   GPower
•   Stata (if you like)
•   R (because it‟s flexible and free)
•   Mplus (SEM, ML?)
•   AMOS (if you like)
7
8
9
Lesson 1: Models in statistics

Models, parsimony, error, mean,
OLS estimators

10
What is a Model?

11
What is a model?
• Representation
– Of reality
– Not reality
• Model aeroplane represents a real
aeroplane
– If model aeroplane = real aeroplane, it
isn‟t a model

12
– Representing and simplifying
• Sifting
– What is important from what is not
important
• Parsimony
– In statistical models we seek parsimony
– Parsimony  simplicity

13
Parsimony in Science
• A model should be:
– 1: able to explain a lot
– 2: use as few concepts as possible
• More it explains
– The more you get
• Fewer concepts
– The lower the price
• Is it worth paying a higher price for a better
model?

14
A Simple Model

• Height of five individuals
– 1.40m
– 1.55m
– 1.80m
– 1.62m
– 1.63m
• These are our DATA
15
A Little Notation
The (vector of) data that we are
Y                   modelling

The ith observation in our
Yi             data.

Y  4,5,6,7,8
Y2  5
16
Greek letters represent the true
value in the population.

         (Beta) Parameters in our model
(population value)

0     The value of the first parameter of our
model in the population.

j      The value of the jth parameter of our
model, in the population.

       (Epsilon) The error in the population
model.
17
Normal letters represent the values in our
sample. These are sample statistics, which are
used to estimate population parameters.
A parameters in our model (sample
b                    statistics)

e              The error in our sample.

The data in our sample which we are
Y                 trying to model.

18
Symbols on top change the meaning.

The data in our sample which we are
Y           trying to model (repeated).

Yˆi      The estimated value of Y, for the ith
case.

The mean of Y.
Y
19
ˆ
So b1  1

I will use b1 (because it is easier to type) 

20
• Not always that simple
– some texts and computer programs use

b = the parameter estimate (as we have
used)
 (beta) = the standardised parameter
estimate
SPSS does this.

21
A capital letter is the set (vector) of
parameters/statistics

B       Set of all parameters (b0, b1, b2, b3 … bp)

Rules are not used very consistently (even by
me).
Don‟t assume you know what someone means,
without checking.

22
• We want a model
– To represent those data
• Model 1:
– 1.40m, 1.55m, 1.80m, 1.62m, 1.63m
– Not a model
• A copy
– VERY unparsimonious
• Data: 5 statistics
• Model: 5 statistics
– No improvement
23
• Model 2:
– The mean (arithmetic mean)
– A one parameter model

n
 Yi
ˆ  b Y 
Yi                      i 1
0
n
24
• Which, because we are lazy, can be
written as

Y
Y 
n

25
The Mean as a Model

26
The (Arithmetic) Mean
• We all know the mean
– The „average‟
– Learned about it at school
– Forget (didn‟t know) about how clever the mean is
• The mean is:
– An Ordinary Least Squares (OLS) estimator
– Best Linear Unbiased Estimator (BLUE)

27
Mean as OLS Estimator
• Going back a step or two
• MODEL was a representation of DATA
– We said we want a model that explains a lot
– How much does a model explain?
DATA = MODEL + ERROR
ERROR = DATA - MODEL
– We want a model with as little ERROR as possible

28
• What is error?

Data (Y)     Model (b0)   Error (e)
mean
1.40                      -0.20
1.55                      -0.05
1.80          1.60         0.20
1.62                       0.02
1.63                       0.03

29
• How can we calculate the „amount‟ of
error?
• Sum of errors

ERROR  ei
ˆ
 (Yi  Y )
 (Yi  b0 )
  0.20   0.05  0.20  0.02  0.03
0
30
– 0 implies no ERROR
• Not the case
– Knowledge about ERROR is useful
• As we shall see later

31
• Sum of absolute errors
– Ignore signs
ERROR   ei
  Yi  Yˆ

  Yi  b0

 0.20  0.05  0.20  0.02  0.03
 0.50
32
• Are small and large errors equivalent?
– One error of 4
– Four errors of 1
– The same?
– What happens with different data?
• Y = (2, 2, 5)
– b0 = 2
– Not very representative
• Y = (2, 2, 4, 4)
– b0 = any value from 2 - 4
– Indeterminate
• There are an infinite number of solutions which would satisfy
our criteria for minimum error
33
• Sum of squared errors (SSE)
2
ERROR  e          i

ˆ)2
 (Yi  Y
2
 (Yi  b0 )
  0.20   0.05  0.20  0.02  0.03
2            2   2     2         2

 0.08

34
• Determinate
• If we minimise SSE
– Get the mean
• Shown in graph
– SSE plotted against b0
– Min value of SSE occurs when
– b0 = mean

35
2

1.8

1.6

1.4

1.2
SSE

1

0.8

0.6

0.4

0.2

0
1   1.1   1.2   1.3   1.4   1.5   1.6   1.7   1.8   1.9        2

b0

36
The Mean as an OLS Estimate

37
Mean as OLS Estimate
• The mean is an Ordinary Least Squares
(OLS) estimate
– As are lots of other things
• This is exciting because
– OLS estimators are BLUE
– Best Linear Unbiased Estimators
– Proven with Gauss-Markov Theorem
• Which we won‟t worry about

38
BLUE Estimators
• Best
– Minimum variance (of all possible unbiased
estimators
– Narrower distribution than other estimators
• e.g. median, mode
• Linear
– Linear predictions      Y Y
– For the mean
– Linear (straight, flat) line

39
• Unbiased
– Centred around true (population) values
– Expected value = population value
– Minimum is biased.
• Minimum in samples > minimum in population
• Estimators
– Errrmm… they are estimators
• Also consistent
– Sample approaches infinity, get closer to
population values
– Variance shrinks

40
SSE and the Standard
Deviation
• Tying up a loose end

ˆ)2
SSE  (Yi  Y
ˆ)2
(Yi  Y
s
n
ˆ)2
(Yi  Y

n 1
41
• SSE closely related to SD
• Sample standard deviation – s
– Biased estimator of population SD
• Population standard deviation - 
– Need to know the mean to calculate SD
• Reduces N by 1
• Hence divide by N-1, not N
– Like losing one df

42
Proof
• That the mean minimises SSE
– Not that difficult
– As statistical proofs go
• Available in
– Maxwell and Delaney – Designing
experiments and analysing data
– Judd and McClelland – Data Analysis (out
of print?)
43
What‟s a df?
• The number of parameters free to vary
– When one is fixed
• Term comes from engineering
– Movement available to structures

44
0 df              1 df
No variation   Fix 1 corner, the
available      shape is fixed

45
Back to the Data
• Mean has 5 (N) df
– 1st moment
•  has N –1 df
– Mean has been fixed
– 2nd moment
– Can think of as amount cases vary away
from the mean

46
While we are at it …
• Skewness has N – 2 df
– 3rd moment
• Kurtosis has N – 3 df
– 4rd moment
– Amount cases vary from 

47
Parsimony and df
• Number of df remaining
– Measure of parsimony
• Model which contained all the data
– Has 0 df
– Not a parsimonious model
• Normal distribution
– Can be described in terms of mean and 
• 2 parameters
– (z with 0 parameters)
48
Summary of Lesson 1
• Statistics is about modelling DATA
– Models have parameters
– Fewer parameters, more parsimony, better
• Models need to minimise ERROR
– Best model, least ERROR
– Depends on how we define ERROR
– If we define error as sum of squared deviations
from predicted value
– Mean is best MODEL
49
50
51
Lesson 2: Models with one
more parameter - regression

52
In Lesson 1 we said …
• Use a model to predict and describe
data
– Mean is a simple, one parameter model

53
More Models

Slopes and Intercepts

54
More Models
• The mean is OK
– As far as it goes
– It just doesn‟t go very far
– Very simple prediction, uses very little
information
that

55
House Prices
• In the UK, two of the largest lenders
(Halifax and Nationwide) compile house
price indices
– Predict the price of a house
– Examine effect of different circumstances
• Look at change in prices
– Guides legislation
• E.g. interest rates, town planning

56
Predicting House Prices
Beds   £ (000s)
1      77
2      74
1      88
3      62
5      90
5      136
2      35
5      134
4      138
1      55
57
One Parameter Model
• The mean

Y  88.9
ˆ
Y  b0  Y
SSE  11806.9
“How much is that house worth?”
“£88,900”
Use 1 df to say that              58
– We might as well use it
– Add a linear function of number of
bedrooms (x1)

ˆ
Y  b0  b1 x1
59
Alternative Expression

• Estimate of Y (expected value of Y)

ˆ
Y  b0  b1 x1
• Value of Y

Yi  b0  b1 xi1  ei
60
Estimating the Model
• We can estimate this model in four different,
equivalent ways
– Provides more than one way of thinking about it
1.   Estimating the slope which minimises SSE
2.   Examining the proportional reduction in SSE
3.   Calculating the covariance
4.   Looking at the efficiency of the predictions

61
Estimate the Slope to Minimise
SSE

62
Estimate the Slope
• Stage 1
– Draw a scatterplot
– x-axis at mean
• Not at zero
• Mark errors on it
– Called „residuals‟
– Sum and square these to find SSE

63
160
140
120
100
80
1.5   2   2.5   3   3.5   4   4.5   5    5.5
60
40
20
0

64
160
140
120
100
80
1.5   2   2.5   3   3.5   4   4.5   5    5.5
60
40
20
0

65
• Add another slope to the chart
– Redraw residuals
– Recalculate SSE
– Move the line around to find slope which
minimises SSE
• Find the slope

66
• First attempt:

67
• Any straight line can be defined with
two parameters
– The location (height) of the slope
• b0
– Sometimes called a
– The gradient of the slope
• b1

68

b1 units

1 unit

69
• Height
b0 units

70
• Height
• If we fix slope to zero
– Height becomes mean
– Hence mean is b0
• Height is defined as the point that the
slope hits the y-axis
– The constant
– The y-intercept

71
• Why the constant?
beds (x1) x0   £ (000s)
– b0x0                        1       1       77
– Where x0 is 1.00 for        2       1       74
every case                  1       1       88
• i.e. x0 is constant      3       1       62
• Implicit in SPSS              5       1       90
– Some packages force         5       1      136
you to make it              2       1       35
explicit                    5       1      134
4       1      138
– (Later on we‟ll need
1       1       55
to make it explicit)

72
• Why the intercept?
– Where the regression line intercepts the y-
axis
– Sometimes called y-intercept

73
Finding the Slope
• How do we find the values of b0 and b1?
best estimates which minimise SSE
– Iterative approach
• Computer intensive – used to matter, doesn‟t
really any more
• (With fast computers and sensible search
algorithms – more on that later)

74
– b0=88.9 (mean)
– b1=10 (nice round number)
• SSE = 14948 – worse than it was
– b0=86.9,   b1=10,   SSE=13828
– b0=66.9,   b1=10,   SSE=7029
– b0=56.9,   b1=10,   SSE=6628
– b0=46.9,   b1=10,   SSE=8228
– b0=51.9,   b1=10,   SSE=7178
– b0=51.9,   b1=12,   SSE=6179
– b0=46.9,   b1=14,   SSE=5957
– ……..
75
• Quite a long time later
– b0 = 46.000372
– b1 = 14.79182
– SSE = 5921
• Gives the position of the
– Regression line (or)
– Line of best fit
• Better than guessing
• Not necessarily the only method
– But it is OLS, so it is the best (it is BLUE)

76
160

140

120

100
Price

80

60

40                                           Actual Price
20                                           Predicted Price
0
0.5   1   1.5   2     2.5   3    3.5     4    4.5   5    5.5

Number of Bedrooms

77
• We now know
– A house with no bedrooms is worth 
£46,000 (??!)
• Told us two things
– Don‟t extrapolate to meaningless values of
x-axis
– Constant is not necessarily useful
• It is necessary to estimate the equation

78
Standardised Regression Line
• One big but:
– Scale dependent
• Values change
– £ to €, inflation
• Scales change
– £, £000, £00?
• Need to deal with this
79
• Don‟t express in „raw‟ units
– Express in SD units
– x1=1.72
– y=36.21
• b1 = 14.79
• We increase x1 by 1, and Ŷ increases by
14.79
14.79  (14.79 / 36.21)SDs  0.408SDs

80
• Similarly, 1 unit of x1 = 1/1.72 SDs
– Increase x1 by 1 SD
– Ŷ increases by 14.79  (1.72/1) = 8.60
• Put them both together

b1   x1
y
81
14.79 1.72
 0.706
36.21
• The standardised regression line
– Change (in SDs) in Ŷ associated with a
change of 1 SD in x1
• A different route to the same answer
– Standardise both variables (divide by SD)
– Find line of best fit

82
• The standardised regression line has a
special name
The Correlation Coefficient
(r)
(r stands for „regression‟, but more on that
later)
• Correlation coefficient is a standardised
regression slope
– Relative change, in terms of SDs

83
Proportional Reduction in
Error

84
Proportional Reduction in Error
• We might be interested in the level of
improvement of the model
– How much less error (as proportion) do we
have
– Proportional Reduction in Error (PRE)
• Mean only
– Error(model 0) = 11806
• Mean + slope
– Error(model 1) = 5921
85
ERROR(0)  ERROR(1)
PRE 
ERROR(0)
ERROR(1)
PRE  1 
ERROR(0)
5921
PRE  1 
11806
PRE  0.4984
86
• But we squared all the errors in the first
place
– So we could take the square root
– (It‟s a shoddy excuse, but it makes the
point)

0.4984  0.706
• This is the correlation coefficient
• Correlation coefficient is the square root
of the proportion of variance explained
87
Standardised Covariance

88
Standardised Covariance
• We are still iterating
– Need a „closed-form‟
– Equation to solve to get the parameter
estimates
• Answer is a standardised covariance
– A variable has variance
– Amount of „differentness‟
• We have used SSE so far
89
• SSE varies with N
– Higher N, higher SSE
• Divide by N
– Gives SSE per person
– (Actually N – 1, we have lost a df to the
mean)
• The variance
• Same as SD2
– We thought of SSE as a scattergram
• Y plotted against X
– (repeated image follows)

90
160
140
120
100
80
1.5   2   2.5   3   3.5   4   4.5   5   5.5
60
40
20
0

91
• Or we could plot Y against Y
– Axes meet at the mean (88.9)
– Draw a square for each point
– Calculate an area for each square
– Sum the areas
• Sum of areas
– SSE
• Sum of areas divided by N
– Variance

92
Plot of Y against Y
180

160

140

120

100

80
0   20   40   60   80    100   120   140   160    180

60

40

20

0
93
Draw Squares
180
138 – 88.9
Area =
160
= 40.1
40.1 x 40.1          140
= 1608.1
120                         138 – 88.9
100
= 40.1

80
0     20       40        60     80    100   120       140   160    180
35 – 88.9
60
= -53.9
40                  Area =
20
-53.9 x -53.9
35 – 88.9                      = 2905.21
= -53.9      0
94
• What if we do the same procedure
– Instead of Y against Y
– Y against X
•   Draw rectangles (not squares)
•   Sum the area
•   Divide by N - 1
•   This gives us the variance of x with y
– The Covariance
– Shortened to Cov(x, y)

95
96
Area
= (-33.9) x (-2)                     4-3=1
= 67.8

55 – 88.9                     138-88.9
= -33.9                       = 49.1
1 - 3 = -2
Area =
49.1 x 1
= 49.1

97
• More formally (and easily)
• We can state what we are doing as an
equation
– Where Cov(x, y) is the covariance

( x  x )( y  y )
Cov( x, y ) 
N 1
• Cov(x,y)=44.2
• What do points in different sectors do
to the covariance?
98
• Problem with the covariance
– Tells us about two things
– The variance of X and Y
– The covariance
• Need to standardise it
– Like the slope
• Two ways to standardise the covariance
– Standardise the variables first
• Subtract from mean and divide by SD
– Standardise the covariance afterwards

99
• First approach
– Much more computationally expensive
• Too much like hard work to do by hand
– Need to standardise every value
• Second approach
– Much easier
– Standardise the final value only
• Need the combined variance
– Multiply two variances
– Find square root (were multiplied in first
place)

100
• Standardised covariance

Cov( x , y )

Var( x )  Var( y )
44.2

2.9  1311
 0.706

101
• The correlation coefficient
– A standardised covariance is a correlation
coefficient

Covariance
r
 variance  variance 

102
• Expanded …

 ( x  x )( y  y ) 
                     
       N 1          
r
 ( x  x ) ( y  y ) 
2              2

 N 1                      

                  N 1 

103
• This means …
– We now have a closed form equation to
calculate the correlation
– Which is the standardised slope
– Which we can use to calculate the
unstandardised slope

104
We know that:

b1   x1
r
y
We know that:

r  y
b1 
x  1    105
r  y
b1 
x 1

0.706  36.21
b1 
1.72
b1  14.79

• So value of b1 is the same as the iterative
approach
106
• The intercept
– Just while we are at it
• The variables are centred at zero
– We subtracted the mean from both
variables
– Intercept is zero, because the axes cross at
the mean

107
• Add mean of y to the constant
• Subtract mean of x
– But not the whole mean of x
– Need to correct it for the slope
c  y  b1 x1
c  88.9  14.8  3
c  46.00
• Naturally, the same
108
Accuracy of Prediction

109
One More (Last One)
• We have one more way to calculate the
correlation
– Looking at the accuracy of the prediction
• Use the parameters
– b0 and b1
– To calculate a predicted value for each
case

110
Actual Predicted
Beds
Price    Price     • Plot actual price
1        77    60.80     against
2        74    75.59     predicted price
1        88    60.80
– From the model
3        62    90.38
5        90   119.96
5       136   119.96
2        35    75.59
5       134   119.96
4       138   105.17
1        55    60.80

111
140

120
Predicted Value

100

80

60

40

20
20   40   60   80       100   120   140   160
Actual Value

112
• r = 0.706
– The correlation
• Seems a futile thing to do
– And at this stage, it is
– But later on, we will see why

113
Some More Formulae
• For hand calculation
xy
r
x 2y 2

• Point biserial

r
M   y1    M y 0  PQ
sd y

114
• Phi (f)
– Used for 2 dichotomous variables

Vote P    Vote Q

Homeowner          A: 19      B: 54

Not homeowner      C: 60      D:53

r
( A  B)(C  D)( A  C )( B  D)

115
• Problem with the phi correlation
– Unless Px= Py (or Px = 1 – Py)
• Maximum (absolute) value is < 1.00
• Tetrachoric can be used
• Rank (Spearman) correlation
– Used where data are ranked

6d       2
r
n(n  1)
2

116
Summary
• Mean is an OLS estimate
– OLS estimates are BLUE
• Regression line
– Best prediction of DV from IV
– OLS estimate (like mean)
• Standardised regression line
– A correlation

117
• Four ways to think about a correlation
– 1.   Standardised regression line
– 2.   Proportional Reduction in Error (PRE)
– 3.   Standardised covariance
– 4.   Accuracy of prediction

118
119
120
Lesson 3: Why Regression?

A little aside, where we look at
why regression has such a curious
name.

121
Regression
The or an act of regression; reversion;
earlier stage of development, as in an
a child
• So why name a statistical technique
explanation?

122
• Francis Galton
– Charles Darwin‟s cousin
– Studying heritability
• Tall fathers have shorter sons
• Short fathers have taller sons
– „Filial regression toward mediocrity‟
– Regression to the mean

123
• Galton thought this was biological fact
– Evolutionary basis?
• Then did the analysis backward
– Tall sons have shorter fathers
– Short sons have taller fathers
• Regression to the mean
– Not biological fact, statistical artefact

124
Other Examples
• Secrist (1933): The Triumph of Mediocrity in
• Second albums often tend to not be as good
as first
• Sequel to a film is not as good as the first
one
• „Curse of Athletics Weekly‟
• Parents think that punishing bad behaviour
works, but rewarding good behaviour doesn‟t

125

• An alternative to a scatterplot

x                      y      126
r=1.00

x
x
x
x
x
x
x

127
r=0.00

x       x

x

x       x

128
From Regression to
Correlation
• Where do we predict an individual‟s
score on y will be, based on their score
on x?
– Depends on the correlation
• r = 1.00 – we know exactly where they
will be
• r = 0.00 – we have no idea
• r = 0.50 – we have some idea
129
r=1.00

Starts here

Will end up
here

x                                  y
130
r=0.00

Starts here

Could end
anywhere here

x                                 y
131
r=0.50
Probably
end
Starts here            somewhere
here

x                      y

132
Galton Squeeze Diagram
• Don‟t show individuals
– Show groups of individuals, from the same
(or similar) starting point
– Shows regression to the mean

133
r=0.00
Ends here

Group starts
here

x    Group starts
y
here
134
r=0.50

x            y
135
r=1.00

x            y
136
1 unit                               r units

x                       y

• Correlation is amount of regression that
doesn‟t occur

137
• No regression
• r=1.00

x   y

138
• Some
regression
• r=0.50

x   y

139
r=0.00

• Lots
(maximum)
regression
• r=0.00

x            y

140
Formula

z y  rxy z x
ˆ

141
Conclusion
• Regression towards mean is statistical necessity
regression = perfection – correlation
• Very non-intuitive
• Interest in regression and correlation
– From examining the extent of regression towards
mean
– By Pearson – worked with Galton
– Stuck with curious name

142
143
144
Lesson 4: Samples to
Populations – Standard Errors
and Statistical Significance

145
The Problem
• In Social Sciences
– We investigate samples
• Theoretically
– Randomly taken from a specified
population
– Every member has an equal chance of
being sampled
– Sampling one member does not alter the
chances of sampling another
• Not the case in (say) physics, biology,
etc.                                       146
Population
• But it‟s the population that we are
interested in
– Not the sample
– Population statistic represented with Greek
letter
– Hat means „estimate‟         ˆ
b
x  x
ˆ         147
• Sample statistics (e.g. mean) estimate
population parameters
• Want to know
– Likely size of the parameter
– If it is > 0

148
Sampling Distribution
• We need to know the sampling
distribution of a parameter estimate
– How much does it vary from sample to
sample
• If we make some assumptions
– We can know the sampling distribution of
many statistics
149
Sampling Distribution of the
Mean
• Given
– Normal distribution
– Random sample
– Continuous data
• Mean has a known sampling distribution
– Repeatedly sampling will give a known
distribution of means
– Centred around the true (population) mean
()
150
Analysis Example: Memory
• Difference in memory for different
words
– 10 participants given a list of 30 words to
learn, and then tested
– Two types of word
• Abstract: e.g. love, justice
• Concrete: e.g. carrot, table

151
Concrete Abstract   Diff (x)
12        4           8
11        7           4   x  2.1
4        6          -2
9       12          -3    x  3.11
8        6           2
12       10           2   N  10
9        8           1
8        5           3
12       10           2
8        4           4
152
Confidence Intervals
• This means
– If we know the mean in our sample
– We can estimate where the mean in the
population () is likely to be
• Using
– The standard error (se) of the mean
– Represents the standard deviation of the
sampling distribution of the mean

153
1 SD contains
68%

Almost 2 SDs
contain 95%

154
• We know the sampling distribution of
the mean
– t distributed
– Normal with large N (>30)
• Know the range within means from
other samples will fall
– Therefore the likely range of 

x
se( x ) 
n
155
• Two implications of equation
– Increasing N decreases SE
• But only a bit
– Decreasing SD decreases SE
• Calculate Confidence Intervals
– From standard errors
• 95% is a standard level of CI
– 95% of samples the true mean will lie within
the 95% CIs
– In large samples: 95% CI = 1.96  SE
– In smaller samples: depends on t
distribution (df=N-1=9)
156
x  2.1,
 x  3.11,
N  10
 x 3.11
se( x )           0.98
n   10
157
95% CI  2.26  0.98  2.22

x  CI    x  CI
-0.12    4.32

158
What is a CI?
• (For 95% CI):
• 95% chance that the true (population)
value lies within the confidence
interval?
• 95% of samples, true mean will land
within the confidence interval?

159
Significance Test
• Probability that  is a certain value
– Almost always 0
• Doesn‟t have to be though
• We want to test the hypothesis that the
difference is equal to 0
– i.e. find the probability of this difference
occurring in our sample IF =0
– (Not the same as the probability that =0)
160
• Calculate SE, and then t
– t has a known sampling distribution
– Can test probability that a certain value is
included

2.1
x                  t       2.14
t                         0.98
se(x )
p  0.061

161
Other Parameter Estimates
• Same approach
– Prediction, slope, intercept, predicted
values
– At this point, prediction and slope are the
same
• Won‟t be later on
• We will look at one predictor only
– More complicated with > 1
162
Testing the Degree of
Prediction
• Prediction is correlation of Y with Ŷ
– The correlation – when we have one IV
• Use F, rather than t
• Started with SSE for the mean only
– This is SStotal
– Divide this into SSresidual
– SSregression
• SStot = SSreg + SSres
163
SSreg df1
F
SS res df 2

df1  k
df 2  N  k  1,
164
• Back to the house prices
– Original SSE (SStotal) = 11806
– SSresidual = 5921
• What is left after our model
– SSregression = 11806 – 5921 = 5885
• What our model explains
• Slope = 14.79
• Intercept = 46.0
• r = 0.706

165
SSreg df1
F
SS res df 2

5885 1
F                    7.95
5921 (10  1  1)
df1  k  1
df 2  N  k  1  8
166
• F = 7.95, df = 1, 8, p = 0.02
– Can reject H0
• H0: Prediction is not better than chance
– A significant effect

167
Statistical Significance:
What does a p-value (really)
mean?

168
A Quiz

• Six questions, each true or false

• An experiment has been done. Carried out
perfectly. All assumptions perfectly satisfied.
Absolutely no problems.
• P = 0.01
– Which of the following can we say?
169
1. You have absolutely disproved the null
hypothesis (that is, there is no
difference between the population
means).

170
2. You have found the probability of the
null hypothesis being true.

171
3. You have absolutely proved your
experimental hypothesis (that there is
a difference between the population
means).

172
4. You can deduce the probability of the
experimental hypothesis being true.

173
5. You know, if you decide to reject the
null hypothesis, the probability that
you are making the wrong decision.

174
6. You have a reliable experimental
finding in the sense that if,
hypothetically, the experiment were
repeated a great number of times, you
would obtain a significant result on
99% of occasions.

175
OK, What is a p-value
• Cohen (1994)
“[a p-value] does not tell us what we
want to know, and we so much want to
know what we want to know that, out
of desperation, we nevertheless believe
it does” (p 997).

176
OK, What is a p-value
• Sorry, didn‟t answer the question
• It‟s The probability of obtaining a result
as or more extreme than the result we
have in the study, given that the null
hypothesis is true
• Not probability the null hypothesis is
true

177
A Bit of Notation
• Not because we like notation
– But we have to say a lot less

•   Probability – P
•   Null hypothesis is true – H
•   Result (data) – D
•   Given - |
178
What‟s a P Value
• P(D|H)
– Probability of the data occurring if the null
hypothesis is true
• Not
• P(H|D)
– Probability that the null hypothesis is true,
given that we have the data = p(H)
• P(H|D) ≠ P(D|H)
179
• What is probability you are prime minister
– Given that you are british
– P(M|B)
– Very low
• What is probability you are British
– Given you are prime minister
– P(B|M)
– Very high
• P(M|B) ≠ P(B|M)

180
• There‟s been a murder
– Someone bumped off a statto for talking too
much
• The police have DNA
• The police have your DNA
– They match(!)
• DNA matches 1 in 1,000,000 people
• What‟s the probability you didn‟t do the
murder, given the DNA match (H|D)

181
• Police say:
– P(D|H) = 1/1,000,000
• Luckily, you have Jeremy on your defence
team
• We say:
– P(D|H) ≠ P(H|D)
• Probability that someone matches the
DNA, who didn‟t do the murder
– Incredibly high

182
Back to the Questions
• Haller and Kraus (2002)
– Asked those questions of groups in
Germany
– Psychology Students
– Psychology lecturers and professors (who
didn‟t teach stats)
– Psychology lecturers and professors (who
did teach stats)
183
1. You have absolutely disproved the null
hypothesis (that is, there is no difference
between the population means).
•   True
•   34% of students
•   15% of professors/lecturers,
•   10% of professors/lecturers teaching statistics
•   False
•   We have found evidence against the null
hypothesis

184
2. You have found the probability of the
null hypothesis being true.
– 32% of students
– 26% of professors/lecturers
– 17% of professors/lecturers teaching
statistics
•   False
•   We don‟t know

185
3. You have absolutely proved your
experimental hypothesis (that there is a
difference between the population means).
–   20% of students
–   13% of professors/lecturers
–   10% of professors/lecturers teaching statistics
•   False

186
4. You can deduce the probability of the
experimental hypothesis being true.
– 59% of students
– 33% of professors/lecturers
– 33% of professors/lecturers teaching
statistics
•   False

187
5. You know, if you decide to reject the null
hypothesis, the probability that you are
making the wrong decision.
•   68% of students
•   67% of professors/lecturers
•   73% of professors professors/lecturers
teaching statistics
•   False
•   Can be worked out
– P(replication)

188
6. You have a reliable experimental finding
in the sense that if, hypothetically, the
experiment were repeated a great
number of times, you would obtain a
significant result on 99% of occasions.
– 41% of students
– 49% of professors/lecturers
– 37% of professors professors/lecturers
teaching statistics
•   False
•   Another tricky one
– It can be worked out
189
One Last Quiz
• I carry out a study
– All assumptions perfectly satisfied
– Random sample from population
– I find p = 0.05
• You replicate the study exactly
– What is probability you find p < 0.05?

190
• I carry out a study
– All assumptions perfectly satisfied
– Random sample from population
– I find p = 0.01
• You replicate the study exactly
– What is probability you find p < 0.05?

191
• Significance testing creates boundaries
and gaps where none exist.
• Significance testing means that we find
it hard to build upon knowledge
– we don‟t get an accumulation of
knowledge

192
• Yates (1951)
"the emphasis given to formal tests of significance
... has resulted in ... an undue concentration of
effort by mathematical statisticians on
investigations of tests of significance applicable
to problems which are of little or no practical
importance ... and ... it has caused scientific
research workers to pay undue attention to the
results of the tests of significance ... and too
little to the estimates of the magnitude of the
effects they are investigating

193
Testing the Slope
• Same idea as with the mean
– Estimate 95% CI of slope
– Estimate significance of difference from a
value (usually 0)
• Need to know the sd of the slope
– Similar to SD of the mean

194
(Y  Yˆ )2
s y. x 
N  k 1

SSres
s y. x 
N  k 1

5921
s y. x          27.2
8            195
• Similar to equation for SD of mean
• Then we need standard error
- Similar (ish)
• When we have standard error
– Can go on to 95% CI
– Significance of difference

196
s y.x
se(by. x ) 
2
( x  x )

27.2
se(by. x )         5.24
26.9

197
• Confidence Limits
• 95% CI
– t dist with N - k - 1 df is 2.31
– CI = 5.24  2.31 = 12.06
• 95% confidence limits

14.8  12.1    14.8  12.1
2.7    26.9

198
• Significance of difference from zero
– i.e. probability of getting result if =0
• Not probability that  = 0

b     14.7
t              2.81
se(b)    5.2
df  N  k  1  8
p  0.02
• This probability is (of course) the same
as the value for the prediction
199
Testing the Standardised
Slope (Correlation)
• Correlation is bounded between –1 and +1
– Does not have symmetrical distribution, except
around 0
• Need to transform it
– Fisher z‟ transformation – approximately
normal

z  0.5[ln(1  r )  ln(1  r )]
1
SE z 
n3                          200
z  0.5[ln(1  0.706)  ln(1  0.706)]
z  0.879
1     1
SEz               0.38
n3   10  3
• 95% CIs
– 0.879 – 1.96 * 0.38 = 0.13
– 0.879 + 1.96 * 0.38 = 1.62

201
• Transform back to correlation

e 1
2y
r  2y
e 1

• 95% CIs = 0.13 to 0.92
• Very wide
– Small sample size
– Maybe that‟s why CIs are not reported?

202
Using Excel
• Functions in excel
– Fisher() – to carry out Fisher
transformation
– Fisherinv() – to transform back to
correlation

203
The Others
• Same ideas for calculation of CIs and
SEs for
– Predicted score
– Gives expected range of values given X
• Same for intercept
– But we have probably had enough

204
Lesson 5: Introducing Multiple
Regression

205
Residuals
• We said
Y = b0 + b1x1
• We could have said
Yi = b0 + b1xi1 + ei
• We ignored the i on the Y
• And we ignored the ei
– It‟s called error, after all
• But it isn‟t just error
– Trying to tell us something
206
What Error Tells Us
• Error tells us that a case has a different
score for Y than we predict
– There is something about that case
• Called the residual
– What is left over, after the model
• Contains information
– Something is making the residual  0
– But what?

207
160

140

120             swimming pool
100
Price

80

Unpleasant
60
neighbours
40                                            Actual Price
20                                            Predicted Price
0
0.5   1   1.5   2     2.5   3     3.5     4     4.5   5    5.5

Number of Bedrooms

208
• The residual (+ the mean) is the value
of Y
If all cases were equal on X
• It is the value of Y, controlling for X
• Other words:
– Holding constant
– Partialling
– Residualising
– Conditioned on

209
Beds £ (000s)
1      77     61      -16         105
2      74     76        2          90
1      88     61      -27          62
3      62     90       28         117
5      90    120       30         119
5      136   120      -16          73
2      35     76       41         129
5      134   120      -14          75
4      138   105      -33          56
1      55     61        6          95
210
• Sometimes adjustment is enough on its own
– Measure performance against criteria
• Teenage pregnancy rate
– Measure pregnancy and abortion rate in areas
– Control for socio-economic deprivation, and
anything else important
– See which areas have lower teenage pregnancy
and abortion rate, given same level of deprivation
– Measure school performance
– Control for initial intake

211
Control?
• In experimental research
– Use experimental control
– e.g. same conditions, materials, time of
day, accurate measures, random
assignment to conditions
• In non-experimental research
– Can‟t use experimental control

212
Analysis of Residuals
• What predicts differences in crime rate
– After controlling for socio-economic
deprivation
– Number of police?
– Crime prevention schemes?
– Rural/Urban proportions?
– Something else
• This is what regression is about

213
• Exam performance
– Consider number of books a student read
(books)
– Number of lectures (max 20) a student
attended (attend)
• Books and attend as IV, grade as DV

214
0         9        45
1        15        57
0        10        45   First 10 cases
2        16        51
4        10        65
4        20        88
1        11        44
4        20        87
3        15        89
0        15        59

215
• Use books as IV
– R=0.492, F=12.1, df=1, 28, p=0.001
– b0=52.1, b1=5.7
– (Intercept makes sense)
• Use attend as IV
– R=0.482, F=11.5, df=1, 38, p=0.002
– b0=37.0, b1=1.9
– (Intercept makes less sense)

216
100

90

80

70

60

50

40

30
-1      0   1   2   3   4   5

Books
217
100

90

80

70

60

50

40

30
5        7   9   11   13   15   17   19   21

Attend
218
Problem
• Use R2 to give proportion of shared
variance
– Books = 24%
– Attend = 23%
• So we have explained 24% + 23% =
47% of the variance
– NO!!!!!

219
• Look at the correlation matrix

BOOKS      1

ATTEND    0.44      1

• Correlation of books and attend is
(unsurprisingly) not zero
– Some of the variance that books shares
with grade, is also shared by attend
220
– No. We need to know how many of my 2
cars are the same cars as her 2 cars
• Similarly with regression
– But we can do this with the residuals
– Residuals are what is left after (say) books
– See of residual variance is explained by
attend
– Can use this new residual variance to
calculate SSres, SStotal and SSreg
221
• Well. Almost.
– This would give us correct values for SS
– Would not be correct for slopes, etc
• Assumes that the variables have a
causal priority
– Why should attend have to take what is
left from books?
– Why should books have to take what is left
by attend?
• Use OLS again

222
• Simultaneously estimate 2 parameters
– b1 and b2
– Y = b0 + b1x1 + b2x2
– x1 and x2 are IVs
• Not trying to fit a line any more
– Trying to fit a plane
• Can solve iteratively
– Closed form equations better
– But they are unwieldy

223
3D scatterplot
(2points only)
y

x2

x1
224
b2

y

b1
b0             x2

x1
225
(Really) Ridiculous Equations

b1 
 y  y x1  x1 x2  x2 2    y  y x2  x2 x1  x1 x2  x2 
x1  x1  x2  x2   x1  x1 x2  x2 
2             2                          2

b2 
 y  y x2  x2 x1  x1 2    y  y x1  x1 x2  x2 x1  x1 
x2  x2  x1  x1   x2  x2 x1  x1 
2             2                          2

b0  y  b1 x1  b2 x2
226
• The good news
– There is an easier way
– It involves matrix algebra
• The good news
– We don‟t really need to know how to do it
– We need to know it exists

227
A Quick Guide to Matrix
Algebra
(I will never make you do it again)

228
Very Quick Guide to Matrix
Algebra
• Why?
– Matrices make life much easier in
multivariate statistics
– Some things simply cannot be done
without them
– Some things are much easier with them
• If you can manipulate matrices
– you can specify calculations v. easily
– e.g. AA’ = sum of squares of a column
• Doesn‟t matter how long the column
229
• A scalar is a number
A scalar: 4
• A vector is a row or column of numbers

A row vector:      2   4 8 7

5
 
11 
A column vector:      
230
• A vector is described as rows x columns

2    4 8 7
– Is a 1  4 vector

5
 
11 
 
– Is a 2  1 vector
– A number (scalar) is a 1  1 vector

231
• A matrix is a rectangle, described as
rows x columns

2 6 5 7 8
         
4 5 7 5 3
1 5 2 7 8
         
• Is a 3 x 5 matrix
• Matrices are referred to with bold capitals
- A is a matrix                           232
• Correlation matrices and covariance
matrices are special
– They are square and symmetrical
– Correlation matrix of books, attend and

 1.00 0.44 0.49 
                
 0.44 1.00 0.48 
 0.49 0.48 1.00 
                
233
• Another special matrix is the identity
matrix I
– A square matrix, with 1 in the diagonal and
0 in the off-diagonal

1      0 0 0
            
0      1 0 0
I
0     0 1 0

0           
       0 0 1

– Note that this is a correlation matrix, with
correlations all = 0
234
Matrix Operations
• Transposition
– A matrix is transposed by putting it on its
side
– Transpose of A is A’ A  7 5 6 
7
 
A'   5 
6
 
235
• Matrix multiplication
– A matrix can be multiplied by a scalar, a
vector or a matrix
– Not commutative
– AB  BA
– To multiply AB
• Number of rows in A must equal number of
columns in B

236
• Matrix by vector

a     d   g   j   aj  dk  gl 
b     e        k    bj  ek  hl 
h   
                                       
c             l   cj  fk  il 
      f   i                        
 2 3 5  2   33 

 2 3 5   2   4    20   43 
9
11   3 
  7 11 1313  3   99   52    90 
    
 17 19 23  4
7                   14 33              
             
       141              
 17 19 23   4          34  57  92     183 
237
• Matrix by matrix

a b  e     f   ae  cf af  bh 

 c d  g
        
  ce  dg cf  dh 

           h                   

 2 3   2 3   4  12 6  15 
 5 7    4 5   10  28 15  35 
                                
                                
 16 21
  38 50 
        
238
• Multiplying by the identity matrix
– Has no effect
– Like multiplying by 1

AI  A
2 3 1 0 2 3
             
5 7   0 1  5 7 
               

239
• The inverse of J is: 1/J
• J x 1/J = 1
• Same with matrices
– Matrices have an inverse
– Inverse of A is A-1
– AA-1=I
• Inverting matrices is dull
– We will do it once
– But first, we must calculate the
determinant
240
• The determinant of A is |A|
• Determinants are important in statistics
– (more so than the other matrix algebra)
• We will do a 2x2
– Much more difficult for larger matrices

241
a b
Ac d  
      

 1.0 0.3 
A
 0.3 1.0 

         
A  1  1  0.3  0.3
A  0.91
242
• Determinants are important because
– Needs to be above zero for regression to
work
– Zero or negative determinant of a
correlation/covariance matrix means
something wrong with the data
• Linear redundancy
• Described as:
– Not positive definite
– Singular (if determinant is zero)
• In different error messages

243
a b 
A
c d 

    
 d  b
 c a 

      
•Now

1 1
A
244
• Find A-1

 1.0 0.3 
A
 0.3 1.0 

         
A  0.91

1      1    1.0  0.3 
A             
  0.3 1.0 

0.91            
1     1.10  0.33 
A        
  0.33 1.10 

             
245
Matrix Algebra with
Correlation Matrices

246
Determinants
• Determinant of a correlation matrix
– The volume of „space‟ taken up by the
(hyper) sphere that contains all of the
points

 1.0 0.0 
A          
 0.0 1.0 
A  1.0
247
X        X

X

X        X

 1.0 0.0 
A          
 0.0 1.0 
A  1.0          248
X

X

X

 1.0 1.0 
A          
 1.0 1.0 
A  0.0          249
Negative Determinant
• Points take up less than no space
– Correlation matrix cannot exist
– Non-positive definite matrix

250
Sometimes Obvious

1.0 1.2 
A        
1.2 1.0 
        
A  0.44

251
Sometimes Obvious (If You
Think)
 1    0.9 0.9 
 0.9
A       1        
0.9 
 0.9 0.9      
            1 

A  2.88
252
Sometimes No Idea
 1.00 0.76 0.40 
 0.76
A        1        
0.30 
 0.40 0.30  1 
                 

A  0.01      1.00 0.75 0.40 
 0.75
A        1        
0.30 
 0.40 0.30  1 
                 
A  0.0075           253
Multiple R for Each Variable
• Diagonal of inverse of correlation matrix
– Used to calculate multiple R
– Call elements aij

1
Ri .123...k    1
aii
254
Regression Weights
• Where i is DV
• j is IV

aij
bi . j 
aij
255
Back to the Good News
• We can calculate the standardised
parameters as
B=Rxx-1 x Rxy
• Where
– B is the vector of regression weights
– Rxx-1 is the inverse of the correlation matrix
of the independent (x) variables
– Rxy is the vector of correlations of the
correlations of the x and y variables
– Now do exercise 3.2

256
One More Thing

• The whole regression equation can be
described with matrices
– very simply

Y  XB  E
257
• Where
– Y = vector of DV
– X = matrix of IVs
– B = vector of coefficients
• Go all the way back to our example

258
1   0    9            e1   45 
1   1    5           e   57 
                       2  
1   0   10            e3   45 
                        
1   2   16 
 b0     e4   51 
1   4   10     e5   65 
             b1       
1   4   20     e6   88 
  2   e7   44 
b
1   1   11
                        
1   4   20            e8   87 
1   3   15            e   89 

1                     9  
    0   15 

 e   59 
 10   

259
The constant – literally a
constant. Could be any
1   0    9        e1   45 
                number, but it is most
   
1   1    5        e2   57 
convenient to make it 1. Used
1   0   10        e   45 
                 „capture‟ 
to  3   the intercept.
1   2   16           e4   51 
1   4       b0   e   65 
10  
            b1    5    
1   4   20    e6   88 
            b2   e   
1   1   11           7   44 
1   4   20           e8   87 
                       
1   3   15           e9   89 
1   0   15           e   59 
                     10   

260
1   0    9           e1   45 
                       
1   1    5           e2   57 
1   0   10           e   45 
                     3  
1   2   16  The matrix 51 values for
 e4   of
1   4       b0   (books65  attend)
10     IVs e5   and
            b1       
1   4   20    e6   88 
            b2   e   
1   1   11           7   44 
1   4   20           e8   87 
                       
1   3   15           e9   89 
1   0   15           e   59 
                     10   

261
1 0 9             e1   45 
                    
1 1 5             e2   57 
1 0 10            e   45 
                  3  
1 2 16            e4   51 
1 4 10  b0   e   65 
 
The parameter
         b1    5    
estimates. We are 20    e6   88 
1 4

trying to find 1 1 11 
the best  b2   e   
 7   44 
values of these. 20 
1 4                e8   87 
                    
1 3 15            e9   89 
1 0 15            e   59 
                  10   

262
Error. We are trying to
1 0
minimise this 9 
          
 e1   45 
   
1   1    5           e2   57 
1   0   10           e   45 
                     3  
1   2   16           e4   51 
1   4       b0   e   65 
10  
            b1    5    
1   4   20    e6   88 
            b2   e   
1   1   11           7   44 
1   4   20           e8   87 
                       
1   3   15           e9   89 
1   0   15           e   59 
                     10   

263
1 0      9           e1   45 
                       
1 1      5           e2   57 
1 0     10           e   45 
                     3  
1 2     16           e4   51 
1 4         b0   e   65 
10  
            b1    5    
1 4     20    e6   88 
            b2   e   
1 1     11           7   44 
1 4     20           e8   87 
                       
1 3     15           e9   89 
1 0
The DV     grade  e10   59 
- 15 
                       

264
• Y=BX+E
• Simple way of representing as many IVs as
you like
Y = b0x0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + e
 b0 
 
 b1 
 x01   x11   x21   x31   x41   x51  b2   e1 
                                       
x
 02    x12   x22   x32   x42   x52  b3   e2 
         
b 
 4
b 
 5
265
 b0 
 
 b1 
 x01 x11 x21 x31 x41            x51  b2   e1 
                                        
x
 02 x12 x22 x32 x42             x52  b3   e2 
         
b 
 4
b 
 5
 b0 x0  b1 x1  ...bk xk  e
266
Generalises to Multivariate
Case
• Y=BX+E
• Y, B and E
– Matrices, not vectors
• Goes beyond this course
– (Do Jacques Tacq‟s course for more)

267
268
269
270
Lesson 6: More on Multiple
Regression

271
Parameter Estimates
• Parameter estimates (b1, b2 … bk) were
standardised
– Because we analysed a correlation matrix
• Represent the correlation of each IV
with the DV
– When all other IVs are held constant

272
• Can also be unstandardised
• Unstandardised represent the unit
change in the DV associated with a 1
unit change in the IV
– When all the other variables are held
constant
• Parameters have standard errors
associated with them
– As with one IV
– Hence t-test, and associated probability
can be calculated
• Trickier than with one IV
273
Standard Error of Regression
Coefficient
• Standardised is easier

1 R      1 2
SEi               Y
n  k 1 1 R i
2

– R2i is the value of R2 when all other predictors are
used as predictors of that variable
• Note that if R2i = 0, the equation is the same as for
previous

274
Multiple R

• The degree of prediction
– R (or Multiple R)
– No longer equal to b
• R2 Might be equal to the sum of squares
of B
– Only if all x‟s are uncorrelated

275
In Terms of Variance
• Can also think of this in terms of
variance explained.
– Each IV explains some variance in the DV
– The IVs share some of their variance
• Can‟t share the same variance twice

276
Variance in Y
accounted for by x1
rx1y2 = 0.36

Variance in Y
The total
variance of Y       accounted for by x2
rx2y2 = 0.36
=1
277
• In this model
– R2 = ryx12 + ryx22
– R2 = 0.36 + 0.36 = 0.72
– R = 0.72 = 0.85
• But
– If x1 and x2 are correlated
– No longer the case

278
Variance in Y
accounted for by x1
rx1y2 = 0.36       Variance shared
between x1 and x2
(not equal to rx1x2)

The total            Variance in Y
variance of Y       accounted for by x2
=1                rx2y2 = 0.36
279
• So
– We can no longer sum the r2
– Need to sum them, and subtract the
shared variance – i.e. the correlation
• But
– It‟s not the correlation between them
– It‟s the correlation between them as a
proportion of the variance of Y
• Two different ways

280
• Based on estimates

2
R  b1ryx1  b2ryx2

• If rx1x2 = 0
– rxy = bx1
– Equivalent to ryx12 + ryx22

281
• Based on correlations

2      2
2
r         r      2ryx1 ryx2 rx1 x2
R 
yx1    yx2
2
1r  x1 x2

• rx1x2 = 0
– Equivalent to ryx12 + ryx22
282
• Can also be calculated using methods
we have seen
– Based on PRE
– Based on correlation with prediction
• Same procedure with >2 IVs

283
• R2 is an overestimate of population
value of R2
– Any x will not correlate 0 with Y
– Any variation away from 0 increases R
– Variation from 0 more pronounced with
lower N
• Need to correct R2

284

2            N 1    2
Adj. R  1  (1  R )
N  k 1

• 1 – R2
– Proportion of unexplained variance
– We multiple this by an adjustment
• More variables – greater adjustment
• More people – less adjustment
285
Shrunken R2
• Some authors treat shrunken and
adjusted R2 as the same thing
– Others don‟t

286
N 1           N  20, k  3
20  1     19
     1.1875
N  k 1        20  3  1 16

N  10, k  8     N  10, k  3
10  1    9       10  1    9
 9                1.5
10  8  1 1      10  3  1 6

287
Extra Bits

• Some stranger things that can
happen
– Counter-intuitive

288
Suppressor variables
• Can be hard to understand
– Very counter-intuitive
• Definition
– An independent variable which increases
the size of the parameters associated with
other independent variables above the size
of their correlations

289
• An example (based on Horst, 1941)
– Success of trainee pilots
– Mechanical ability (x1), verbal ability (x2),
success (y)
• Correlation matrix

Mech             Verb    Success
Mech             1        0.5        0.3
Verb          0.5          1          0
Success          0.3          0          1

290
– Mechanical ability correlates 0.3 with
success
– Verbal ability correlates 0.0 with success
– What will the parameter estimates be?
guess)

291
• Mechanical ability
– b = 0.4
– Larger than r!
• Verbal ability
– b = -0.2
– Smaller than r!!
• So what is happening?
– You need verbal ability to do the test
– Not related to mechanical ability
• Measure of mechanical ability is contaminated
by verbal ability

292
• High mech, low verbal
– High mech
• This is positive
– Low verbal
• Negative, because we are talking about
standardised scores
• Your mech is really high – you did well on the
mechanical test, without being good at the
words
• High mech, high verbal
because of verbal, and need to be brought
down a bit
293
Another suppressor?
x1    x2     y
x1       1    0.5   0.3
x2      0.5    1    0.2
y       0.3   0.2    1

b1 =
b2 =
294
Another suppressor?
x1    x2     y
x1    1    0.5   0.3
x2   0.5    1    0.2
y    0.3   0.2    1

b1 =0.26
b2 = -0.06
295
And another?
x1     x2     y
x1      1     0.5   0.3
x2     0.5     1    -0.2
y     0.3   -0.2    1

b1 =
b2 =
296
And another?
x1     x2     y
x1    1     0.5   0.3
x2   0.5     1    -0.2
y   0.3   -0.2    1

b1 = 0.53
b2 = -0.47
297
One more?
x1     x2     y
x1     1    -0.5   0.3
x2   -0.5     1    0.2
y     0.3    0.2    1

b1 =
b2 =
298
One more?
x1     x2     y
x1     1    -0.5   0.3
x2   -0.5     1    0.2
y     0.3    0.2    1

b1 = 0.53
b2 = 0.47
299
• Suppression happens when two opposing
forces are happening together
– And have opposite effects
• Don‟t throw away your IVs,
– Just because they are uncorrelated with the DV
• Be careful in interpretation of regression
estimates
– Really need the correlations too, to interpret what
is going on
– Cannot compare between studies with different
IVs

300
Standardised Estimates > 1
• Correlations are bounded
-1.00 ≤ r ≤ +1.00
– We think of standardised regression
estimates as being similarly bounded
• But they are not
– Can go >1.00, <-1.00
– R cannot, because that is a proportion of
variance
301
• Three measures of ability
– Mechanical ability, verbal ability 1, verbal
ability 2
– Score on science exam
Mech         Verbal1   Verbal2   Scores
Mech             1       0.1       0.1      0.6
Verbal1          0.1         1       0.9      0.6
Verbal2          0.1       0.9         1      0.3
Scores          0.6       0.6       0.3        1

–Before reading on, what are the parameter
estimates?
302
Mech                0.56
Verbal1               1.71
Verbal2              -1.29
• Mechanical
• Verbal 1
– Very high
• Verbal 2
– Very low
303
• What is going on
– It‟s a suppressor again
– An independent variable which increases
the size of the parameters associated with
other independent variables above the size
of their correlations
• Verbal 1 and verbal 2 are correlated so
highly
– They need to cancel each other out

304
Variable Selection
• What are the appropriate independent
variables to use in a model?
– Depends what you are trying to do
• Multiple regression has two separate
uses
– Prediction
– Explanation

305
• Prediction                • Explanation
– What will happen in       – Why did something
the future?                 happen?
– Emphasis on               – Emphasis on
practical application       understanding
– Variables selected          phenomena
(more) empirically        – Variables selected
– Value free                  theoretically
– Not value free

306
• Visiting the doctor
– Precedes suicide attempts
– Predicts suicide
• Does not explain suicide
• More on causality later on …
• Which are appropriate variables
– To collect data on?
– To include in analysis?
– Decision needs to be based on theoretical knowledge
of the behaviour of those variables
– Statistical analysis of those variables (later)
• Unless you didn‟t collect the data
– Common sense (not a useful thing to say)
307
Variable Entry Techniques
• Entry-wise
– All variables entered simultaneously
• Hierarchical
– Variables entered in a predetermined order
• Stepwise
– Variables entered according to change in
R2
– Actually a family of techniques

308
• Entrywise
– All variables entered simultaneously
– All treated equally
• Hierarchical
– Entered in a theoretically determined order
– Change in R2 is assessed, and tested for
significance
– e.g. sex and age
• Should not be treated equally with other
variables
• Sex and age MUST be first
– Confused with hierarchical linear modelling
309
• Stepwise
– Variables entered empirically
– Variable which increases R2 the most goes
first
• Then the next …
– Variables which have no effect can be
removed from the equation
• Example
– IVs: Sex, age, extroversion,
– DV: Car – how long someone spends
looking after their car

310
• Correlation Matrix

SEX           AGE           EXTRO CAR
SEX            1.00         -0.05       0.40  0.66
AGE           -0.05          1.00       0.40  0.23
EXTRO          0.40          0.40       1.00  0.67
CAR            0.66          0.23       0.67  1.00

311
• Entrywise analysis
– r2 = 0.64

b       p
SEX          0.49   <0.01
AGE          0.08    0.46
EXTRO        0.44   <0.01

312
• Stepwise Analysis
– Data determines the order
– Model 1: Extroversion, R2 = 0.450
– Model 2: Extroversion + Sex, R2 = 0.633

b             p
EXTRO          0.48         <0.01
SEX           0.47         <0.01

313
• Hierarchical analysis
– Theory determines the order
– Model 1: Sex + Age, R2 = 0.510
– Model 2: S, A + E, R2 = 0.638
– Change in R2 = 0.128, p = 0.001

SEX          0.49   <0.01
2        AGE          0.08   0.46
EXTRO         0.44   <0.01

314
• Which is the best model?
– Entrywise – OK
– Stepwise – excluded age
• Did have a (small) effect
– Hierarchical
• The change in R2 gives the best estimate of the
importance of extroversion
• Other problems with stepwise
– F and df are wrong (cheats with df)
– Unstable results
• Small changes (sampling variance) – large
differences in models

315
– Uses a lot of paper
– Don‟t use a stepwise procedure to pack

316
Is Stepwise Always Evil?
• Yes
• All right, no
• Research goal is predictive (technological)
– Not explanatory (scientific)
– What happens, not why
• N is large
– 40 people per predictor, Cohen, Cohen, Aiken,
West (2003)
• Cross validation takes place
317
A quick note on R2
R2 is sometimes regarded as the „fit‟ of a
regression model
• If good fit is required – maximise R2
– Leads to entering variables which do not
make theoretical sense

318
Critique of Multiple Regression
• Goertzel (2002)
– “Myths of murder and multiple regression”
– Skeptical Inquirer (Paper B1)
• Econometrics and regression are „junk
science‟
– Multiple regression models (in US)
– Used to guide social policy

319
More Guns, Less Crime
– (controlling for other factors)
• Lott and Mustard: A 1% increase in gun
ownership
– 3.3% decrease in murder rates
• But:
– More guns in rural Southern US
– More crime in urban North (crack cocaine
epidemic at time of data)

320
Executions Cut Crime
• No difference between crimes in states
in US with or without death penalty
• Ehrlich (1975) controlled all variables
that effect crime rates
– Death penalty had effect in reducing crime
rate
• No statistical way to decide who‟s right

321
Legalised Abortion
• Donohue and Levitt (1999)
– Legalised abortion in 1970‟s cut crime in 1990‟s
• Lott and Whitley (2001)
– “Legalising abortion decreased murder rates by …
0.5 to 7 per cent.”
• It‟s impossible to model these data
– Controlling for other historical events
– Crack cocaine (again)

322
Another Critique
• Berk (2003)
– Regression analysis: a constructive critique (Sage)
• Three cheers for regression
– As a descriptive technique
• Two cheers for regression
– As an inferential technique
• One cheer for regression
– As a causal analysis

323
Is Regression Useless?
• Do regression carefully
– Don‟t go beyond data which you have a
strong theoretical understanding of
• Validate models
– Where possible, validate predictive power
of models in other areas, times, groups
• Particularly important with stepwise

324
Lesson 7: Categorical
Independent Variables

325
Introduction

326
Introduction
• So far, just looked at continuous
independent variables
• Also possible to use categorical
(nominal, qualitative) independent
variables
– e.g. Sex; Job; Religion; Region; Type (of
anything)
• Usually analysed with t-test/ANOVA

327
Historical Note
• But these (t-test/ANOVA) are special
cases of regression analysis
– Aspects of General Linear Models (GLMs)
• So why treat them differently?
– Fisher‟s fault
– Computers‟ fault
• Regression, as we have seen, is
computationally difficult
– Matrix inversion and multiplication
– Unfeasible, without a computer
328
• In the special cases where:
• You have one categorical IV
– It is much easier to do it by partitioning of
sums of squares
• These cases
– Very rare in „applied‟ research
– Very common in „experimental‟ research
• Fisher worked at Rothamsted agricultural
research station
• Never have problems manipulating wheat, pigs,
cabbages, etc

329
• In psychology
– Led to a split between „experimental‟
psychologists and „correlational‟
psychologists
– Experimental psychologists (until recently)
would not think in terms of continuous
variables
• Still (too) common to dichotomise a
variable
– Too difficult to analyse it properly
330
The Approach

331
The Approach
• Recode the nominal variable
– Into one, or more, variables to represent that
variable
• Names are slightly confusing
– Some texts talk of „dummy coding‟ to refer to all
of these techniques
– Some (most) refer to „dummy coding‟ to refer to
one of them
– Most have more than one name

332
• If a variable has g possible categories it
is represented by g-1 variables
• Simplest case:
– Smokes: Yes or No
– Variable 1 represents „Yes‟
– Variable 2 is redundant
• If it isn‟t yes, it‟s no

333
The Techniques

334
• We will examine two coding schemes
– Dummy coding
• For two groups
• For >2 groups
– Effect coding
• For >2 groups
• Look at analysis of change
– Equivalent to ANCOVA
– Pretest-posttest designs

335
Dummy Coding – 2 Groups
• Also called simple coding by SPSS
• A categorical variable with two groups
• One group chosen as a reference group
– The other group is represented in a variable
• e.g. 2 groups: Experimental (Group 1) and
Control (Group 0)
– Control is the reference group
– Dummy variable represents experimental group
• Call this variable „group1‟

336
• For variable „group1‟
– 1 = „Yes‟, 2=„No‟

Original         New
Category        Variable
Exp              1
Con              0

337
• Some data
• Group is x, score is y

Control      Experimental
Group           Group
Experiment 1                10           10
Experiment 2                10           20
Experiment 3                10           30

338
• Control Group = 0
– Intercept = Score on Y when x = 0
– Intercept = mean of control group
• Experimental Group = 1
– b = change in Y when x increases 1 unit
– b = difference between experimental
group and control group

339
35
30
25
represents difference
20between means

15
10
5
0
Control Group                      Experimental Group

Experiment 1      Experiment 2       Experiment 3

340
Dummy Coding – 3+ Groups
• With three groups the approach is the
similar
• g = 3, therefore g-1 = 2 variables
needed
• 3 Groups
– Control
– Experimental Group 1
– Experimental Group 2

341
Original
Gp1             Gp2
Category
Con                0               0
Gp1                1               0
Gp2                0               1

• Recoded into two variables
– Note – do not need a 3rd variable
• If we are not in group 1 or group 2 MUST be in
control group
• 3rd variable would add no information
• (What would happen to determinant?)
342
• F and associated p
– Tests H0 that

g1  g2  g3
• b1 and b2 and associated p-values
– Test difference between each experimental
group and the control group
• To test difference between
experimental groups
– Need to rerun analysis

343
• One more complication
– Have now run multiple comparisons
– Increases a – i.e. probability of type I error
• Need to correct for this
– Bonferroni correction
– Multiply given p-values by two/three
(depending how many comparisons were

344
Effect Coding
• Usually used for 3+ groups
• Compares each group (except the reference
group) to the mean of all groups
– Dummy coding compares each group to the
reference group.
• Example with 5 groups
– 1 group selected as reference group
• Group 5

345
• Each group (except reference) has a
variable
– 1 if the individual is in that group
– 0 if not
– -1 if in reference group

group   group_1 group_2 group_3 group_4
1         1       0       0       0
2         0       1       0       0
3         0       0       1       0
4         0       0       0       1
5        -1      -1      -1      -1
346
Examples
• Dummy coding and Effect Coding
• Group 1 chosen as reference group
each time
• Data      Group       Mean       SD
1          52.40     4.60
2          56.30     5.70
3          60.10     5.00
Total        56.27     5.88
347
• Dummy

Group   dummy2    dummy3
1     0         0
2     1         0
3     0         1

• Effect
Group   Effect2   effect3
1     -1        -1
2     1         0
3     0         1       348
Dummy                  Effect
R=0.543, F=5.7,        R=0.543, F=5.7, df=2,
df=2, 27, p=0.009     27, p=0.009
b0 = 52.4,             b0 = 56.27,
b1 = 3.9, p=0.100      b1 = 0.03, p=0.980
b2 = 7.7, p=0.002      b2 = 3.8, p=0.007

b0  g1                b0  G
b1  g2  g1           b1  g2  G
b2  g3  g1           b2  g3  G        349
In SPSS
• SPSS provides two equivalent procedures for
regression
– Regression (which we have been using)
– GLM (which we haven‟t)
• GLM will:
– Automatically code categorical variables
– Automatically calculate interaction terms
• GLM won‟t:
– Give standardised effects
– Give hierarchical R2 p-values
– Allow you to not understand

350
ANCOVA and Regression

351
• Test
– (Which is a trick; but it‟s designed to make
• Use employee data.sav
– Compare the pay rise (difference between
salbegin and salary)
– For ethnic minority and non-minority staff
• What do you find?
352
ANCOVA and Regression
• Dummy coding approach has one special use
– In ANCOVA, for the analysis of change
• Pre-test post-test experimental design
– Control group and (one or more) experimental
groups
– Tempting to use difference score + t-test / mixed
design ANOVA
– Inappropriate

353
• Salivary cortisol levels
– Used as a measure of stress
– Not absolute level, but change in level over
day may be interesting
• Test at: 9.00am, 9.00pm
• Two groups
– High stress group (cancer biopsy)
• Group 1
– Low stress group (no biopsy)
• Group 0

354
AM         PM       Diff
High Stress   20.1        6.8     13.3
Low Stress    22.3       11.8     10.5

• Correlation of AM and PM = 0.493
(p=0.008)
• Has there been a significant difference
in the rate of change of salivary
cortisol?
– 3 different approaches
355
• Approach 1 – find the differences, do a
t-test
– t = 1.31, df=26, p=0.203
• Approach 2 – mixed ANOVA, look for
interaction effect
– F = 1.71, df = 1, 26, p = 0.203
– F = t2
• Approach 3 – regression (ANCOVA)
based approach

356
– IVs: AM and group
– DV: PM
– b1 (group) = 3.59, standardised b1=0.432,
p = 0.01
• Why is the regression approach better?
– The other two approaches took the
difference
– Assumes that r = 1.00
– Any difference from r = 1.00 and you add
error variance
• Subtracting error is the same as adding error

357
• Using regression
– Ensures that all the variance that is
subtracted is true
– Reduces the error variance
• Two effects
• Compensates for differences between groups
– Removes error variance

358
In SPSS
• SPSS automates all of this
– But you have to understand it, to know
what it is doing
• Use Analyse, GLM, Univariate ANOVA

359
Outcome here

Categorical
predictors here

Continuous
predictors here

Click options                     360
Select parameter
estimaters

361
More on Change
• If difference score is correlated with
either pre-test or post-test
– Subtraction fails to remove the difference
between the scores
– If two scores are uncorrelated
• Difference will be correlated with both
• Failure to control
– Equal SDs, r = 0
• Correlation of change and pre-score =0.707

362
Even More on Change
• A topic of surprising complexity
– What I said about difference scores isn‟t
always true
• Lord‟s paradox – it depends on the precise
– Collins and Horn (1993). Best methods for
the analysis of change
– Collins and Sayer (2001). New methods for
the analysis of change.

363
Lesson 8: Assumptions in
Regression Analysis

364
The Assumptions
1. The distribution of residuals is normal (at
each value of the dependent variable).
2. The variance of the residuals for every set
of values for the independent variable is
equal.
• violation is called heteroscedasticity.
3. The error term is additive
•   no interactions.
4. At every value of the dependent variable
the expected (mean) value of the residuals
is zero
•   No non-linear relationships                365
5. The expected correlation between residuals,
for any two cases, is 0.
•   The independence assumption (lack of
autocorrelation)
6. All independent variables are uncorrelated
with the error term.
7. No independent variables are a perfect
linear function of other independent
variables (no perfect multicollinearity)
8. The mean of the error term is zero.

366
What are we going to do …
• Deal with some of these assumptions in
some detail
• Deal with others in passing only
– look at them again later on

367
Assumption 1: The
Distribution of Residuals is
Normal at Every Value of the
Dependent Variable

368
Look at Normal Distributions
• A normal distribution
– symmetrical, bell-shaped (so they say)

369
What can go wrong?
• Skew
– non-symmetricality
– one tail longer than the other
• Kurtosis
– too flat or too peaked
– kurtosed
• Outliers
– Individual cases which are far from the
distribution
370
Effects on the Mean
• Skew
– biases the mean, in direction of skew
• Kurtosis
– mean not biased
– standard deviation is
– and hence standard errors, and
significance tests

371
Examining Univariate
Distributions
•   Histograms
•   Boxplots
•   P-P plots
•   Calculation based methods

372
Histograms
30
A and B        30

20                  20

10                  10

0                   0

373
• C and D
40               14

12

30
10

8

20

6

4
10

2

0                0

374
•E&F

20

10

0

375
Histograms can be tricky ….
7            6               6

6
5               5

5
4               4

4
3               3

3

2               2
2

1               1
1

0            0               0

7            7               6

6            6
5

5            5
4

4            4
3

3            3

2
2            2

1
1            1

0            0               0

376
Boxplots

377
•A&B                        P-P Plots
1.00                                 1.00

.75                                  .75

.50                                  .50

.25                                  .25

0.00                                 0.00
0.00   .25   .50   .75     1.00      0.00   .25   .50   .75         1.00

378
•C&D
1.00                               1.00

.75                                .75

.50                                .50

.25                                .25

0.00                               0.00
0.00   .25   .50   .75   1.00      0.00   .25   .50   .75    1.00

379
•E&F
1.00                               1.00

.75                                .75

.50                                .50

.25                                .25

0.00                               0.00
0.00   .25   .50   .75   1.00      0.00   .25   .50   .75   1.00

380
Calculation Based
• Skew and Kurtosis statistics
• Outlier detection statistics

381
Skew and Kurtosis Statistics
• Normal distribution
– skew = 0
– kurtosis = 0
• Two methods for calculation
– Fisher‟s and Pearson‟s
• Associated standard error
– can be used for significance of departure from
normality
– not actually very useful
• Never normal above N = 400                    382
Skewness SE Skew Kurtosis SE Kurt

A       -0.12   0.172   -0.084   0.342
B       0.271   0.172    0.265   0.342
C       0.454   0.172    1.885   0.342
D       0.117   0.172   -1.081   0.342
E       2.106   0.172     5.75   0.342
F       0.171   0.172    -0.21   0.342

383
Outlier Detection
• Calculate distance from mean
– z-score (number of standard deviations)
– deleted z-score
• that case biased the mean, so remove it
– Look up expected distance from mean
• 1% 3+ SDs
• Calculate influence
– how much effect did that case have on the mean?

384
Non-Normality in Regression

385
Effects on OLS Estimates
• The mean is an OLS estimate
• The regression line is an OLS estimate
• Lack of normality
– biases the position of the regression slope
– makes the standard errors wrong
• probability values attached to statistical
significance wrong

386
Checks on Normality
• Check residuals are normally distributed
– SPSS will draw histogram and p-p plot of
residuals
• Use regression diagnostics
– Lots of them
– Most aren‟t very interesting

387
Regression Diagnostics
• Residuals
– standardised, unstandardised, studentised,
deleted, studentised-deleted
– look for cases > |3| (?)
• Influence statistics
– Look for the effect a case has
– If we remove that case, do we get a different
– DFBeta, Standardised DFBeta
• changes in b

388
– DfFit, Standardised DfFit
• change in predicted value
– Covariance ratio
• Ratio of the determinants of the covariance
matrices, with and without the case
• Distances
– measures of „distance‟ from the centroid
– some include IV, some don‟t

389
More on Residuals
• Residuals are trickier than you might
have imagined
• Raw residuals
– OK
• Standardised residuals
– Residuals divided by SD

e    2

se 
n  k 1            390
Leverage
• But
– That SD is wrong
– Variance of the residuals is not equal
• Those further from the centroid on the
predictors have higher variance
• Need a measure of this
• Distance from the centroid is leverage,
or h (or sometimes hii)
• One predictor
– Easy
391
1
hi  
 xi  x                  2

n ( x  x )  2

• Minimum hi is 1/n, the maximum is 1
• Except
– SPSS uses standardised leverage - h*
• It doesn‟t tell you this, it just uses it

392
1
h i  hi 
*

n

hi 
*    xi  x 2

( x  x ) 2

• Minimum 0, maximum (N-1/N)

393
• Multiple predictors
– Calculate the hat matrix (H)
– Leverage values are the diagonals of this
matrix
1
H  X(X' X) X'
– Where X is the augmented matrix of
predictors (i.e. matrix that includes the
constant)
– Hence leverage hii – element ii of H
394
• Example of calculation of hat matrix
1
                                         
 1 15   1 15   1              15      1 15   0.318 0.273         
                                                                  
 1 20   1 20   1              20      1 20   0.273 0.236         
H           ... ...    ...                      
... ...                          ...      ... ...                    
 1 65   1 65   1
                                   
    1 65  
                            
                               65                            0.318 

                            

395
Standardised / Studentised
• Now we can calculate the standardised
residuals
– SPSS calls them studentised residuals
– Also called internally studentised residuals

ei
ei 
se 1  hi
396
Deleted Studentised Residuals
• Studentised residuals do not have a
known distribution
– Cannot use them for inference
• Deleted studentised residuals
– Externally studentised residuals
– Jackknifed residuals
• Distributed as t
• With df = N – k – 1
397
Testing Significance
• We can calculate the probability of a
residual
– Is it sampled from the same population
• BUT
– Massive type I error rate
– Bonferroni correct it
• Multiply p value by N

398
Bivariate Normality
• We didn‟t just say “residuals normally
distributed”
• We said “at every value of the
dependent variables”
• Two variables can be normally
distributed – univariate,
– but not bivariate

399
• Couple‟s IQs
– male and female
FEMALE                                                                          MALE
8                                                                               6

5

6

4

4                                                                               3

2

2
Frequency

1

0                                                                               0
60.0   70.0   80.0    90.0   100.0   110.0   120.0   130.0          140.0       60.0   70.0   80.0   90.0   100.0   110.0   120.0   130.0   140.0

–Seem reasonably normal
400
• But wait!!
160

140

120

100

80

60
MALE

40
40      60   80   100   120   140   160

FEMALE

401
• When we look at bivariate normality
– not normal – there is an outlier
• So plot X against Y
• OK for bivariate
– but – may be a multivariate outlier
– Need to draw graph in 3+ dimensions
– can‟t draw a graph in 3 dimensions
• But we can look at the residuals instead
…

402
• IQ histogram of residuals
12

10

8

6

4

2

0

403
Multivariate Outliers …
• Will be explored later in the exercises

• So we move on …

404
Normality
• Skew and Kurtosis
– Skew – much easier to deal with
– Kurtosis – less serious anyway
• Transform data
– removes skew
– positive skew – log transform
– negative skew - square

405
Transformation
• May need to transform IV and/or DV
– More often DV
• time, income, symptoms (e.g. depression) all positively
skewed
– can cause non-linear effects (more later) if only
one is transformed
– alters interpretation of unstandardised parameter
– May alter meaning of variable
– May add / remove non-linear and moderator
effects
406
• Change measures
– increase sensitivity at ranges
• avoiding floor and ceiling effects
• Outliers
– Can be tricky
– Why did the outlier occur?
• Error? Delete them.
• Weird person? Probably delete them
• Normal person? Tricky.

407
– You are trying to model a process
• is the data point „outside‟ the process
• e.g. lottery winners, when looking at salary
• yawn, when looking at reaction time

– Which is better?
• A good model, which explains 99% of your
data?
• A poor model, which explains all of it
• Pedhazur and Schmelkin (1991)
– analyse the data twice

408
• We will spend much less time on the
other 6 assumptions
• Can do exercise 8.1.

409
Assumption 2: The variance of
the residuals for every set of
values for the independent
variable is equal.

410
Heteroscedasticity
• This assumption is a about
heteroscedasticity of the residuals
– Hetero=different
– Scedastic = scattered
• We don‟t want heteroscedasticity
– we want our data to be homoscedastic
• Draw a scatterplot to investigate

411
160

140

120

100

80

60
MALE

40
40      60   80   100   120   140   160
412
FEMALE
• Only works with one IV
– need every combination of IVs
• Easy to get – use predicted values
– use residuals there
• Plot predicted values against residuals
– or   standardised residuals
– or   deleted residuals
– or   standardised deleted residuals
– or   studentised residuals
• A bit like turning the scatterplot on its
side
413
Good – no heteroscedasticity

Predicted Value

414

Predicted Value

415
Testing Heteroscedasticity
•    White‟s test
–    Not automatic in SPSS (is in SAS)
–    Luckily, not hard to do
1.   Do regression, save residuals.
2.   Square residuals
3.   Square IVs
4.   Calculate interactions of IVs
– e.g. x1•x2, x1•x3, x2 • x3
416
5. Run regression using
– squared residuals as DV
– IVs, squared IVs, and interactions as IVs
6. Test statistic = N x R2
– Distributed as c2
– Df = k (for second regression)
•   Use education and salbegin to predict
salary (employee data.sav)
–   R2 = 0.113, N=474, c2 = 53.5, df=5, p <
0.0001

417
8
Plot of Pred and Res
6

4

2

0

-2

-4
-2           0          2         4          6   8

Regression Standardized Predicted Value

418
Magnitude of
Heteroscedasticity
• Chop data into “slices”
– 5 slices, based on X (or predicted score)
• Done in SPSS
– Calculate variance of each slice
– Check ratio of smallest to largest
– Less than 10:1
• OK

419
The Visual Bander
• New in SPSS 12

420
• Variances of the 5 groups
1                           .219
2                           .336
3                           .757
4                           .751
5                          3.119

• We have a problem
– 3 / 0.2 ~= 15
421
Dealing with
Heteroscedasticity
•   Use Huber-White estimates
– Very easy in Stata
– Fiddly in SPSS – bit of a hack
•   Use Complex samples
1. Create a new variable where all cases are
equal to 1, call it const
2. Use Complex Samples, Prepare for
Analysis
3. Create a plan file

422
4.   Sample weight is const
5.   Finish
6.   Use Complex Samples, GLM
7.   Use plan file created, and set up
model as in GLM
(More on complex samples later)

In Stata, do regression as normal, and
click “robust”.

423
Heteroscedasticity –
Implications and Meanings
Implications
• What happens as a result of
heteroscedasticity?
– Parameter estimates are correct
• not biased
– Standard errors (hence p-values) are
incorrect

424
However …
• If there is no skew in predicted scores
– P-values a tiny bit wrong
• If skewed,
– P-values very wrong
• Can do exercise

425
Meaning
• What is heteroscedasticity trying to tell
us?
– Our model is wrong – it is misspecified
– Something important is happening that we
have not accounted for
• e.g. amount of money given to charity
(given)
– depends on:
• earnings
• degree of importance person assigns to the
charity (import)
426
• Do the regression analysis
– R2 = 0.60, F=31.4, df=2, 37, p < 0.001
• seems quite good
– b0 = 0.24, p=0.97
– b1 = 0.71, p < 0.001
– b2 = 0.23, p = 0.031
• White‟s test
– c2 = 18.6, df=5, p=0.002
• The plot of predicted values against
residuals …
427
• Plot shows heteroscedastic relationship
428
• Which means …
– the effects of the variables are not additive
– If you think that what a charity does is
important
• you might give more money
• how much more depends on how much money
you have

429
70

60

50

40

30

Earnings
20
GIVEN

High

10                                      Low
4    6   8   10   12   14   16

IMPORT
430
heteroscedasticity
– it is the equivalent of homogeneity of
variance in ANOVA/t-tests

431
Assumption 3: The Error Term

432
• What heteroscedasticity shows you
– effects of variables need to be additive
• Heteroscedasticity doesn‟t always show it to
you
– can test for it, but hard work
– (same as homogeneity of covariance assumption
in ANCOVA)
• Have to know it from your theory
• A specification error

433
• Two IVs
– Alcohol has sedative effect
• A bit makes you a bit tired
• A lot makes you very tired
– Some painkillers have sedative effect
• A bit makes you a bit tired
• A lot makes you very tired
– A bit of alcohol and a bit of painkiller
doesn‟t make you very tired
– Effects multiply together, don‟t add
together
434
• If you don‟t test for it
– It‟s very hard to know that it will happen
• So many possible non-additive effects
– Cannot test for all of them
– Can test for obvious
• In medicine
– Choose to test for salient non-additive
effects
– e.g. sex, race

435
Assumption 4: At every value of
the dependent variable the
expected (mean) value of the
residuals is zero

436
Linearity
• Relationships between variables should be
linear
– best represented by a straight line
• Not a very common problem in social
sciences
– except economics
– measures are not sufficiently accurate to make a
difference
• R2 too low
• unlike, say, physics

437
• Relationship between speed of travel
and fuel used
Fuel

Speed

438
• R2 = 0.938
– looks pretty good
– know speed, make a good prediction of
fuel
• BUT
– look at the chart
– if we know speed we can make a perfect
prediction of fuel used
– R2 should be 1.00

439
Detecting Non-Linearity
• Residual plot
– just like heteroscedasticity
• Using this example
– very, very obvious
– usually pretty obvious

440
Residual plot

441
• Linearity = additivity along the range of the
IV
• Jeremy rides his bicycle harder
– Increase in speed depends on current speed
– MacCallum and Mar (1995). Distinguishing
between moderator and quadratic effects in
multiple regression. Psychological Bulletin.

442
Assumption 5: The expected
correlation between residuals, for
any two cases, is 0.

The independence assumption (lack of
autocorrelation)

443
Independence Assumption
• Also: lack of autocorrelation
• Tricky one
– often ignored
– exists for almost all tests
• All cases should be independent of one
another
– knowing the value of one case should not tell you
anything about the value of other cases

444
How is it Detected?
• Can be difficult
– need some clever statistics (multilevel
models)
• Better off avoiding situations where it
arises
• Residual Plots
• Durbin-Watson Test

445
Residual Plots
• Were data collected in time order?
– If so plot ID number against the residuals
– Look for any pattern
• Test for linear relationship
• Non-linear relationship
• Heteroscedasticity

446
2

1
Residual

0

-1

-2

0   10           20           30   40
Participant Number

447
How does it arise?
Two main ways
• time-series analyses
– When cases are time periods
• weather on Tuesday and weather on Wednesday
correlated
• inflation 1972, inflation 1973 are correlated
• clusters of cases
– patients treated by three doctors
– children from different classes
– people assessed in groups

448
Why does it matter?
• Standard errors can be wrong
– therefore significance tests can be wrong
• Parameter estimates can be wrong
– really, really wrong
– from positive to negative
• An example
– students do an exam (on statistics)
– choose one of three questions
• IV: time

449
•Result, with line of best fit
90

80

70

60

50

40

30

20

10
10        20   30   40       50   60   70

Time                                 450
• Result shows that
– people who spent longer in the exam,
• BUT …
– we haven‟t considered which question
– we might have violated the independence
assumption
• DV will be autocorrelated
• Look again
– with questions marked
451
• Now somewhat different

90

80

70

60

50

40
Question
30
3

20                                            2

10                                            1
10        20   30   40   50   60   70

Time
452
• Now, people that spent longer got
– questions differed in difficulty
– do a hard one, get better grade
– if you can do it, you can do it quickly
• Very difficult to analyse well
– need multilevel models

453
Durbin Watson Test
• Not well implemented in SPSS
• Depends on the order of the data
– Reorder the data, get a different result
• Doesn‟t give statistical significance of
the test

454
Assumption 6: All independent
variables are uncorrelated
with the error term.

455
Uncorrelated with the Error
Term

• A curious assumption
– by definition, the residuals are uncorrelated
with the independent variables (try it and
see, if you like)
• It is about the DV
– must have no effect (when the IVs have
been removed)
– on the DV
456
• Problem in economics
– Demand increases supply
– Supply increases wages
– Higher wages increase demand
• OLS estimates will be (badly) biased in
this case
– need a different estimation procedure
– two-stage least squares
• simultaneous equation modelling

457
Assumption 7: No independent
variables are a perfect linear
function of other independent
variables

no perfect multicollinearity

458
No Perfect Multicollinearity
• IVs must not be linear functions of one
another
– matrix of correlations of IVs is not positive definite
– cannot be inverted
– analysis cannot proceed
• Have seen this with
– age, age start, time working
– also occurs with subscale and total

459
• Large amounts of collinearity
– a problem (as we shall see) sometimes
– not an assumption

460
Assumption 8: The mean of the
error term is zero.

You will like this one.

461
Mean of the Error Term = 0
• Mean of the residuals = 0
• That is what the constant is for
– if the mean of the error term deviates from
zero, the constant soaks it up

Y   0  1 x1  
Y  (  0  3)  1 x1  (  3)
- note, Greek letters because we are
462
• Can do regression without the constant
– E.g R2 = 0.995, p < 0.001
• Looks good

463
13

12

11

10
y

9

8

7

6   7   8   9        10   11   12   13
x1

464
465
Lesson 9: Issues in
Regression Analysis

Things that alter the
interpretation of the regression
equation

466
The Four Issues
•   Causality
•   Sample sizes
•   Collinearity
•   Measurement error

467
Causality

468
What is a Cause?
• Debate about definition of cause
– some statistics (and philosophy) books try
to avoid it completely
– We are not going into depth
• just going to show why it is hard
• Two dimensions of cause
– Ultimate versus proximal cause
– Determinate versus probabilistic
469
Proximal versus Ultimate
• Why am I here?
– I walked here because
– This is the location of the class because
– Eric Tanenbaum asked me because
– (I don‟t know)
– because I was in my office when he rang
because
– I am a lecturer at York because
– I saw an advert in the paper because

470
– I exist because
– My parents met because
– My father had a job …

• Proximal cause
– the direct and immediate cause of
something
• Ultimate cause
– the thing that started the process off
– I fell off my bicycle because of the bump
– I fell off because I was going too fast

471
Determinate versus Probabilistic
Cause
• Why did I fall off my bicycle?
– I was going too fast
– But every time I ride too fast, I don‟t fall
off
– Probabilistic cause
• Why did my tyre go flat?
– A nail was stuck in my tyre
– Every time a nail sticks in my tyre, the tyre
goes flat
– Deterministic cause
472
• Can get into trouble by mixing them
together
– Eating deep fried Mars Bars and doing no
exercise are causes of heart disease
– “My Grandad ate three deep fried Mars
Bars every day, and the most exercise he
ever got was when he walked to the shop
– (Deliberately?) confusing deterministic and
probabilistic causes

473
Criteria for Causation
• Association
• Direction of Influence
• Isolation

474
Association
• Correlation does not mean causation
– we all know
• But
– Causation does mean correlation
• Need to show that two things are related
– may be correlation
– my be regression when controlling for third (or
more) factor

475
• Relationship between price and sales
– suppliers may be cunning
– when people want it more
• stick the price up

Price         Demand   Sales
Price             1      0.6       0
Demand            0.6        1     0.6
Sales             0      0.6       1

– So – no relationship between price
and sales
476
– Until (or course) we control for demand
– b1 (Price) = -0.56
– b2 (Demand) = 0.94
• But which variables do we enter?

477
Direction of Influence
• Relationship between A and B
– three possible processes

A               B       A causes B

A               B      B causes A

A               B     C causes A & B

C                            478
• How do we establish the direction of
influence?
– Longitudinally?

Barometer
Storm
Drops

– Now if we could just get that barometer
needle to stay where it is …

• Where the role of theory comes in
(more on this later)
479
Isolation
• Isolate the dependent variable from all
other influences
– as experimenters try to do
• Cannot do this
– can statistically isolate the effect
– using multiple regression

480
Role of Theory
• Strong theory is crucial to making
causal statements
• Fisher said: to make causal statements
– don‟t rely purely on statistical analysis
• Need strong theory to guide analyses
– what critics of non-experimental research
don‟t understand

481
• S.J. Gould – a critic
– says correlate price of petrol and his age,
for the last 10 years
– find a correlation
– Ha! (He says) that doesn‟t mean there is a
– Of course not! (We say).
• No social scientist would do that analysis
without first thinking (very hard) about the
possible causal relations between the variables
of interest
• Would control for time, prices, etc …

482
• Atkinson, et al. (1996)
– relationship between college grades and
number of hours worked
– negative correlation
– Need to control for other variables –
ability, intelligence
• Gould says “Most correlations are non-
causal” (1982, p243)
– Of course!!!!

483
laugh
toilet
vomit
karaoke
curtains closed
sleeping
I drink a lot of              headache
beer              equations (beermat)
thirsty
16 causal                fried breakfast
relations                     no beer
curry
chips
120 non-causal         falling over
correlations            lose keys
484
• Abelson (1995) elaborates on this
– „method of signatures‟
• A collection of correlations relating to
the process
– the „signature‟ of the process
• e.g. tobacco smoking and lung cancer
– can we account for all of these findings
with any other theory?

485
1.   The longer a person has smoked cigarettes, the
greater the risk of cancer.
2.   The more cigarettes a person smokes over a given
time period, the greater the risk of cancer.
3.   People who stop smoking have lower cancer rates
than do those who keep smoking.
4.   Smoker‟s cancers tend to occur in the lungs, and be of
a particular type.
5.   Smokers have elevated rates of other diseases.
6.   People who smoke cigars or pipes, and do not usually
inhale, have abnormally high rates of lip cancer.
7.   Smokers of filter-tipped cigarettes have lower cancer
rates than other cigarette smokers.
8.   Non-smokers who live with smokers have elevated
cancer rates.
(Abelson, 1995: 183-184)
486
– In addition, should be no anomalous
correlations
• If smokers had more fallen arches than non-
smokers, not consistent with theory
• Failure to use theory to select
appropriate variables
– specification error
– e.g. in previous example
– Predict wealth from price and sales
• increase price, price increases
• Increase sales, price increases

487
• Sometimes these are indicators of the
process
– e.g. barometer – stopping the needle won‟t
help
– e.g. inflation? Indicator or cause?

488
No Causation without
Experimentation
• Blatantly untrue
– I don‟t doubt that the sun shining makes
us warm
• Why the aversion?
– Pearl (2000) says problem is no
mathematical operator
– No one realised that you needed one
– Until you build a robot

489
AI and Causality
• A robot needs to make judgements
• Needs to have a mathematical
representation of causality
– Suddenly, a problem!
– Doesn‟t exist
• Most operators are non-directional
• Causality is directional
490
Sample Sizes

“How many subjects does it take
to run a regression analysis?”

491
Introduction
• Social scientists don‟t worry enough about the
sample size required
– “Why didn‟t you get a significant result?”
– “I didn‟t have a large enough sample”
• More recently awareness of sample size is
increasing
– use too few – no point doing the research
– use too many – waste their time
492
• Research funding bodies
• Ethical review panels
– both become more interested in sample
size calculations
• We will look at two approaches
– Rules of thumb (quite quickly)
– Power Analysis (more slowly)

493
Rules of Thumb
• Lots of simple rules of thumb exist
– 10 cases per IV
– >100 cases
– Green (1991) more sophisticated
• To test significance of R2 – N = 50 + 8k
• To test sig of slopes, N = 104 + k
• Rules of thumb don‟t take into account
all the information that we have
– Power analysis does

494
Power Analysis
Introducing Power Analysis
• Hypothesis test
– tells us the probability of a result of that
magnitude occurring, if the null hypothesis
is correct (i.e. there is no effect in the
population)
• Doesn‟t tell us
– the probability of that result, if the null
hypothesis is false
495
• According to Cohen (1982) all null
hypotheses are false
– everything that might have an effect, does
have an effect
• it is just that the effect is often very tiny

496
Type I Errors
• Type I error is false rejection of H0
• Probability of making a type I error
– a – the significance value cut-off
• usually 0.05 (by convention)
• Always this value
• Not affected by
– sample size
– type of test

497
Type II errors
• Type II error is false acceptance of the
null hypothesis
– Much, much trickier
• We think we have some idea
– we almost certainly don‟t
• Example
– I do an experiment (random sampling, all
assumptions perfectly satisfied)
– I find p = 0.05

498
– You repeat the experiment exactly
• different random sample from same population
– What is probability you will find p < 0.05?
– ………………
– Another experiment, I find p = 0.01
– Probability you find p < 0.05?
– ………………
• Very hard to work out
– not intuitive
– need to understand non-central sampling
distributions (more in a minute)
499
• Probability of type II error = beta ()
– same as population regression parameter
(to be confusing)
• Power = 1 – Beta
– Probability of getting a significant result

500
State of the World

H0 True         H0 false
(no effect to   (effect to be
be found)         found)

H0 true (we find                    Type II error

Research
no effect – p >
0.05)                           p=
power = 1 - 

Findings
H0 false (we find
an effect – p <
0.05)
Type I error
p=a              
501
• Four parameters in power analysis
– a – prob. of Type I error
–  – prob. of Type II error (power = 1 – )
– Effect size – size of effect in population
–N
• Know any three, can calculate the
fourth
– Look at them one at a time

502
•   a Probability of Type I error
– Usually set to 0.05
– Somewhat arbitrary
• sometimes adjusted because of circumstances
– rarely because of power analysis
– May want to adjust it, based on power
analysis

503
•  – Probability of type II error
– Power (probability of finding a result)
=1–
– Standard is 80%
• Some argue for 90%
– Implication that Type I error is 4 times
more serious than type II error
• adjust ratio with compromise power analysis

504
•   Effect size in the population
– Most problematic to determine
– Three ways
1. What effect size would be useful to find?
•   R2 = 0.01 - no use (probably)
2. Base it on previous research
– what have other people found?
3. Use Cohen‟s conventions
– small R2 = 0.02
– medium R2 = 0.13
– large R2 = 0.26

505
– Effect size usually measured as f2
– For R2

2
R
f 
2

1 R 2

506
– For (standardised) slopes

2
2  sri
f       2
1 R
– Where sr2 is the contribution to the
variance accounted for by the variable of
interest
– i.e. sr2 = R2 (with variable) – R2 (without)
• change in R2 in hierarchical regression

507
• N – the sample size
– usually use other three parameters to
determine this
– sometimes adjust other parameters (a)
based on this
– e.g. You can have 50 participants. No
more.

508
Doing power analysis
• With power analysis program
– SamplePower, GPower, Nquery

• With SPSS MANOVA
– using non-central distribution functions
– Uses MANOVA syntax
• Relies on the fact you can do anything with
MANOVA
• Paper B4

509
Underpowered Studies
• Research in the social sciences is often
underpowered
– Why?
– See Paper B11 – “the persistence of
underpowered studies”

510
• Power traditionally focuses on p values
– Paper B8 – “Obtaining regression
coefficients that are accurate, not simply
significant”

511
Collinearity

512
Collinearity as Issue and
Assumption
• Collinearity (multicollinearity)
– the extent to which the independent
variables are (multiply) correlated
• If R2 for any IV, using other IVs = 1.00
– perfect collinearity
– variable is linear sum of other variables
– regression will not proceed
– (SPSS will arbitrarily throw out a variable)
513
• R2 < 1.00, but high
– other problems may arise
• Four things to look at in collinearity
– meaning
– implications
– detection
– actions

514
Meaning of Collinearity
• Literally „co-linearity‟
– lying along the same line
• Perfect collinearity
– when some IVs predict another
– Total = S1 + S2 + S3 + S4
– S1 = Total – (S2 + S3 + S4)
– rare

515
• Less than perfect
– when some IVs are close to predicting
– correlations between IVs are high (usually,
but not always)

516
Implications
• Effects the stability of the parameter
estimates
– and so the standard errors of the
parameter estimates
– and so the significance
• Because
– shared variance, which the regression
procedure doesn‟t know where to put
517
• Red cars have more accidents than
other coloured cars
– because of the effect of being in a red car?
– because of the kind of person that drives a
red car?
• we don‟t know
– No way to distinguish between these three:
Accidents = 1 x colour + 0 x person
Accidents = 0 x colour + 1 x person
Accidents = 0.5 x colour + 0.5 x person

518
• Sex differences
– due to genetics?
– due to upbringing?
– (almost) perfect collinearity
• statistically impossible to tell

519
• When collinearity is less than perfect
– increases variability of estimates between
samples
– estimates are unstable
– reflected in the variances, and hence
standard errors

520
Detecting Collinearity
• Look at the parameter estimates
– large standardised parameter estimates
(>0.3?), which are not significant
• be suspicious
• Run a series of regressions
– each IV as DV
– all other IVs as IVs
• for each IV
521
• Sounds like hard work?
– SPSS does it for us!
– Tolerance – calculated for every IV
2
Tolerance  1-R
– Variance Inflation Factor
• sq. root of amount s.e. has been increased

1
VIF 
Tolerance
522
Actions
What you can do about collinearity
“no quick fix” (Fox, 1991)
1. Get new data
•   avoids the problem
•   address the question in a different way
•   e.g. find people who have been raised as
the „wrong‟ gender
•   exist, but rare
•   Not a very useful suggestion
523
2. Collect more data
•   not different data, more data
•   collinearity increases standard error (se)
•   se decreases as N increases
•   get a bigger N
3. Remove / Combine variables
•   If an IV correlates highly with other IVs
•   Not telling us much new
•   If you have two (or more) IVs which are
very similar
•   e.g. 2 measures of depression, socio-
economic status, achievement, etc
524
•   sum them, average them, remove one
•   Many measures
•   use principal components analysis to reduce
them
3. Use stepwise regression (or some
flavour of)
•   Can be useful in theoretical vacuum
4. Ridge regression
•   not very useful
•   behaves weirdly

525
Measurement Error

526
What is Measurement Error
• In social science, it is unlikely that we
measure any variable perfectly
– measurement error represents this
imperfection
• We assume that we have a true score
– T
• A measure of that score
–x

527
x T e
• just like a regression equation
– standardise the parameters
– T is the reliability
• the amount of variance in x which comes from T
• but, like a regression equation
– assume that e is random and has mean of zero
– more on that later

528
Simple Effects of
Measurement Error
• Lowers the measured correlation
– between two variables
• Real correlation
– true scores (x* and y*)
• Measured correlation
– measured scores (x and y)

529
True correlation
of x and y
rx*y*

x*                      y*

Reliability of x      Reliability of y
rxx                   ryy

e          x                      y          e

Measured
correlation of x and y
rxy                          530
• Attenuation of correlation

rxy  rx * y *  rxx ryy
• Attenuation corrected correlation

rxy
rx * y * 
rxx ryy
531
• Example

rxx  0.7
ryy  0.8   rx* y* 
rxy
rxy  0.3               rxx ryy
0.3
rx* y*               0.40
0.7  0.8

532
Complex Effects of
Measurement Error
• Really horribly complex
• Measurement error reduces correlations
– reduces estimate of 
– reducing one estimate
• increases others
– because of effects of control
– combined with effects of suppressor
variables
– exercise to examine this
533
Dealing with Measurement
Error

• Attenuation correction
– very dangerous
– not recommended
• Avoid in the first place
– use reliable measures
• don‟t categorise
• Age: 10-20, 21-30, 31-40 …

534
Complications
• Assume measurement error is
– linear
– e.g. weight – people may under-report / over-
report at the extremes
• Linear
– particularly the case when using proxy variables

535
• e.g. proxy measures
– Want to know effort on childcare, count
number of children
• 1st child is more effort than last
– Want to know financial status, count
income
• 1st £10 much greater effect on financial status
than the 1000th.

536
Lesson 10: Non-Linear
Analysis in Regression

537
Introduction
• Non-linear effect occurs
– when the effect of one independent
variable
– is not consistent across the range of the IV
• Assumption is violated
– expected value of residuals = 0
– no longer the case

538
Some Examples

539
A Learning Curve
Skill

Experience    540
Performance
Yerkes-Dodson Law of Arousal

Arousal              541
Enthusiasm Levels over a
Lesson on Regression
Enthusiastic
Suicidal

0                              3.5
Time                   542
• Learning
– line changed direction once
• Yerkes-Dodson
– line changed direction once
• Enthusiasm
– line changed direction twice

543
Everything is Non-Linear
• Every relationship we look at is non-
linear, for two reasons
– Exam results cannot keep increasing with
• Linear in the range we examine
– For small departures from linearity
• Cannot detect the difference
• Non-parsimonious solution

544
Non-Linear Transformations

545
Bending the Line
• Non-linear regression is hard
– We cheat, and linearise the data
• Do linear regression
Transformations
• We need to transform the data
– rather than estimating a curved line
• which would be very difficult
• may not work with OLS
– we can take a straight line, and bend it
– or take a curved line, and straighten it
• back to linear (OLS) regression

546
• We still do linear regression
– Linear in the parameters
– Y = b1x + b2x2 + …
• Can do non-linear regression
– Non-linear in the parameters
– Y = b1x + b2x2 + …
• Much trickier
– Statistical theory either breaks down OR
becomes harder

547
• Linear transformations
– multiply by a constant
– change the slope and the intercept

548
y=2x
y=x + 3
y

y=x

x
549
• Linear transformations are no use
– alter the slope and intercept
– don‟t alter the standardised parameter
estimate
• Non-linear transformation
– will bend the slope
y = x2
– one change of direction

550
– Cubic transformation
y = x2 + x3
– two changes of direction

551

y=0 + 0.1x + 1x2

552
Square Root Transformation

y=20 + -3x + 5x

553
Cubic Transformation

y = 3 - 4x + 2x2 - 0.2x3
6
5
4
3
2
1
0
0   1    2    3        4   5    6

554
Logarithmic Transformation

y = 1 + 0.1x + 10log(x)

555
Inverse Transformation

y = 20 -10x + 8(1/x)

556
• To estimate a non-linear regression
– we don‟t actually estimate anything non-
linear
– we transform the x-variable to a non-linear
version
– can estimate that straight line
– represents the curve
– we don‟t bend the line, we stretch the
space around the line, and make it flat

557
Detecting Non-linearity

558
Draw a Scatterplot
• Draw a scatterplot of y plotted against x
– see if it looks a bit non-linear
– e.g. Anscombe‟s data
– e.g. Education and beginning salary
• from bank data
• drawn in SPSS
• with line of best fit

559
• Anscombe (1973)
– constructed a set of datasets
– show the importance of graphs in
regression/correlation
• For each dataset
N                                      11
Mean of x                               9
Mean of y                             7.5
Equation of regression line      y = 3 + 0.5x
sum of squares (X - mean)             110
correlation coefficient              0.82
R2                                   0.67
560
561
562
563
564
A Real Example
• Starting salary and years of education
– From employee data.sav

565
Expected value
of error
(residual) is > 0

Expected value
of error
Educational Level (years)            (residual) is < 0
566
Use Residual Plot
• Scatterplot is only good for one variable
– use the residual plot (that we used for
heteroscedasticity)
• Good for many variables

567
• We want
– points to lie in a nice straight sausage

568
• We don‟t want
– a nasty bent sausage

569
• Educational level and starting salary
10

8

6

4

2

0

-2
-2    -1     0      1      2      3

570
Carrying Out Non-Linear
Regression

571
Linear Transformation
• Linear transformation doesn‟t change
– interpretation of slope
– standardised slope
– se, t, or p of slope
– R2
• Can change
– effect of a transformation

572
• Actually more complex
– with some transformations can add a
constant with no effect (e.g. quadratic)
• With others does have an effect
– inverse, log
• Sometimes it is necessary to add a
constant
– negative numbers have no square root
– 0 has no log

573
Education and Salary
Linear Regression
• Saw previously that the assumption of
expected errors = 0 was violated
• Anyway …
– R2 = 0.401, F=315, df = 1, 472, p < 0.001
– salbegin = -6290 + 1727  educ
– Standardised
• b1 (educ) = 0.633
– Both parameters make sense
574
Non-linear Effect
• Compute new variable
– educ2 = educ2
• Add this variable to the equation
– R2 = 0.585, p < 0.001
– salbegin = 46263 + -6542  educ + 310  educ2
• slightly curious
– Standardised
• b1 (educ) = -2.4
• b2 (educ2) = 3.1
– What is going on?

575
• Collinearity
– is what is going on
– Correlation of educ and educ2
• r = 0.990
– Regression equation becomes difficult
(impossible?) to interpret
• Need hierarchical regression
– what is the change in R2
– is that change significant?
– R2 (change) = 0.184, p < 0.001

576
Cubic Effect
• While we are at it, let‟s look at the cubic
effect
– R2 (change) = 0.004, p = 0.045
– 19138 + 103  e + -206  e2 + 12  e3
– Standardised:
b1(e) = 0.04
b2(e2) = -2.04
b3(e3) = 2.71

577
Fourth Power
• Keep going while we are ahead
– won‟t run
• ???
• Collinearity is the culprit
– Tolerance (educ4) = 0.000005
– VIF = 215555
• Matrix of correlations of IVs is not
positive definite
– cannot be inverted

578
Interpretation
• Tricky, given that parameter estimates
are a bit nonsensical
• Two methods
• 1: Use R2 change
– Save predicted values
• or calculate predicted values to plot line of best
fit
– Save them from equation
– Plot against IV

579
50000

40000

30000

20000

10000                                            Cubic

0                                            Linear
8   10    12    14   16   18   20   22

Education (Years)                                 580
• Differentiate with respect to e
• We said:
s = 19138 + 103  e + -206  e2 + 12  e3
– but first we will simplify it to quadratic
s = 46263 + -6542  e + 310  e2

• dy/dx = -6542 + 310 x 2 x e

581
Education Slope
9      -962
10      -342
11       278
12       898
13      1518
14      2138
1 year of education
15      2758    at the higher end of
16      3378   the scale, better than
17      3998    1 year at the lower
18      4618      end of the scale.
MBA versus GCSE
19      5238
20      5858
582
• Differentiate Cubic
19138 + 103  e + -206  e2 + 12  e3

dy/dx = 103 – 206  2  e + 12  3  e2

• Can calculate slopes for quadratic and
cubic at different values

583
9          -962        -689
10          -342        -417
11           278         -73
12           898         343
13         1518          831
14         2138         1391
15         2758         2023
16         3378         2727
17         3998         3503
18         4618         4351
19         5238         5271
20         5858         6263
584
A Quick Note on
Differentiation
• For y = xp
– dx/dy = pxp-1
• For equations such as
y =b1x + b2xP
dy/dx = b1 + b2pxp-1

• y = 3x + 4x2
– dy/dx = 3 + 4 • 2x
585
• y = b1x + b2x2 + b3x3
– dy/dx = b1 + b2 • 2x + b3 • 3 • x2

• y = 4x + 5x2 + 6x3
• dx/dy = 4 + 5 • 2 • x + 6 • 3 • x2

• Many functions are simple to
differentiate
– Not all though

586
Automatic Differentiation
• If you
– Don‟t know how to differentiate
– Can‟t be bothered to look up the function
• Can use automatic differentiation
software

587
588
Lesson 11: Logistic Regression

Dichotomous/Nominal Dependent
Variables

589
Introduction
• Often in social sciences, we have a
dichotomous/nominal DV
– we will look at dichotomous first, then a quick look
at multinomial
• Dichotomous DV
• e.g.
–   guilty/not guilty
–   pass/fail
–   won/lost
590
Why Won‟t OLS Do?

591
Example: Passing a Test
• Test for bus drivers
– pass/fail
– we might be interested in degrees of pass fail
• a company which trains them will not
• fail means „pay for them to take it again‟
• Develop a selection procedure
– Two predictor variables
– Score – Score on an aptitude test
– Exp – Relevant prior experience (months)

592
• 1st ten cases
Score      Exp   Pass
5         6     0
1         15    0
1         12    0
4         6     0
1         15    1
1         6     0
4         16    1
1         10    1
3         12    0
4         26    1
593
• DV
– pass (1 = Yes, 0 = No)
• Just consider score first
– Carry out regression
– Score as IV, Pass as DV
– R2 = 0.097, F = 4.1, df = 1, 48, p = 0.028.
– b0 = 0.190
– b1 = 0.110, p=0.028
• Seems OK

594
• Or does it? …
• 1st Problem – pp plot of residuals
1.00

.75

.50

.25

0.00
0.00        .25         .50   .75   1.00

Observed Cum Prob                      595
• 2nd problem - residual plot

596
• Problems 1 and 2
– strange distributions of residuals
– parameter estimates may be wrong
– standard errors will certainly be wrong

597
• 3rd problem – interpretation
– I score 2 on aptitude.
– Pass = 0.190 + 0.110  2 = 0.41
– I score 8 on the test
– Pass = 0.190 + 0.110  8 = 1.07
• Seems OK, but
– What does it mean?
– Cannot score 0.41 or 1.07
• can only score 0 or 1
• Cannot be interpreted
– need a different approach
598
A Different Approach
Logistic Regression

599
Logit Transformation
• In lesson 10, transformed IVs
– now transform the DV
• Need a transformation which gives us
– graduated scores (between 0 and 1)
– No upper limit
• we can‟t predict someone will pass twice
– No lower limit
• you can‟t do worse than fail
600
Step 1: Convert to Probability
• First, stop talking about values
– for each value of score, calculate
probability of pass
• Solves the problem of graduated scales

601
probability of
failure given a
score of 1 is 0.7
Score 1 2 3 4 5
N       7 5 6 4 2
Fail
P     0.7 0.5 0.6 0.4 0.2
N       3 5 4 6 8
Pass
P     0.3 0.5 0.4 0.6 0.8
probability of
passing given a
score of 5 is 0.8
602
This is better
• Now a score of 0.41 has a meaning
– a 0.41 probability of pass
• But a score of 1.07 has no meaning
– cannot have a probability > 1 (or < 0)
– Need another transformation

603
Step 2: Convert to Odds-Ratio
Need to remove upper limit
• Convert to odds
• Odds, as used by betting shops
– 5:1, 1:2
• Slightly different from odds in speech
– a 1 in 2 chance
– odds are 1:1 (evens)
– 50%
604
• Odds ratio = (number of times it
happened) / (number of times it didn‟t
happen)

p(event)      p(event )
odds ratio               
p(not event ) 1  p(event )

605
• 0.8 = 0.8/0.2 = 4
– equivalent to 4:1 (odds on)
– 4 times out of five
• 0.2 = 0.2/0.8 = 0.25
– equivalent to 1:4 (4:1 against)
– 1 time out of five

606
• Now we have solved the upper bound
problem
– we can interpret 1.07, 2.07, 1000000.07
• But we still have the zero problem
– we cannot interpret predicted scores less
than zero

607
Step 3: The Log
• Log10 of a number(x)
log( x )
10                x
• log(10) = 1
• log(100) = 2
• log(1000) = 3
608
• log(1) = 0
• log(0.1) = -1
• log(0.00001) = -5

609
Natural Logs and e
• Don‟t use log10
– Use loge
• Natural log, ln
• Has some desirable properties, that log10
doesn‟t
–   For us
–   If y = ln(x) + c
–    dy/dx = 1/x
–   Not true for any other logarithm

610
• Be careful – calculators and stats
packages are not consistent when they
use log
– Sometimes log10, sometimes loge
– Can prove embarrassing (a friend told me)

611
Take the natural log of the odds ratio
• Goes from -  +
– can interpret any predicted value

612
Putting them all together
• Logit transformation
– log-odds ratio
– not bounded at zero or one

613
Score 1      2    3    4     5
N    7   5   6   4        2
Fail
P   0.7 0.5 0.6 0.4      0.2
N    3   5   4   6        8
Pass
P   0.3 0.5 0.4 0.6      0.8
Odds (Fail)    2.33 1.00 1.50 0.67 0.25
log(odds)fail   0.85 0.00 0.41 -0.41 -1.39

614
1
0.9
0.8
0.7
probability

Probability gets closer
0.6
to zero, but never
0.5                  reaches it as logit
0.4                     goes down.
0.3
0.2
0.1
0
-3.5   -3   -2.5   -2   -1.5   -1   -0.5   0   0.5   1   1.5   2   2.5    3    3.5

Logit

615
• Hooray! Problem solved, lesson over
– errrmmm… almost
• Because we are now using log-odds
ratio, we can‟t use OLS
– we need a new technique, called Maximum
Likelihood (ML) to estimate the parameters

616
Parameter Estimation using
ML
ML tries to find estimates of model
parameters that are most likely to give
rise to the pattern of observations in
the sample data
• All gets a bit complicated
– OLS is a special case of ML
– the mean is an ML estimator

617
• Don‟t have closed form equations
– must be solved iteratively
– estimates parameters that are most likely
to give rise to the patterns observed in the
data
– by maximising the likelihood function (LF)
– except to note that sometimes, the
estimates do not converge
• ML cannot find a solution

618
Interpreting Output
Using SPSS
• Overall fit for:
– step (only used for stepwise)
– block (for hierarchical)
– model (always)
– in our model, all are the same
– c2=4.9, df=1, p=0.025
• F test

619
Om nibus Tests of Model Coe fficients

Chi-square      df            Sig.
St ep 1    St ep         4.990           1          .025
Block         4.990           1          .025
Model         4.990           1          .025

620
• Model summary
– -2LL (=c2/N)
– Cox & Snell R2
– Nagelkerke R2
– Different versions of R2
• No real R2 in logistic regression
• should be considered „pseudo R2‟

621
Model Summa ry

-2 Log     Cox & Snell   Nagelk erke
St ep   lik elihood    R Square      R Square
1          64.245            .095          .127

622
• Classification Table
– predictions of model
– based on cut-off of 0.5 (by default)
– predicted values x actual values

623
Cl assi fication Tablea

Predic ted

PASS
Percentage
Observed                               0            1           Correc t
St ep 1   PASS                    0                  18            8           69.2
1                  12           12           50.0
Overall Percent age
60.0

a. The cut value is .500

624
Model parameters
•B
– Change in the logged odds associated with
a change of 1 unit in IV
– just like OLS regression
– difficult to interpret
• SE (B)
– Standard error
– Multiply by 1.96 to get 95% CIs

625
Va riables in the Equa tion

B                 S. E.          W ald
St ep
a
SCORE                 -.467              .219         4.566
1        Constant            1.314                .714         3.390
a. Variable(s) ent ered on step 1: SCORE.

Variable s in the Equation

95.0% C.I.for EXP(B)
Sig.        Exp(B)          Lower        Upper
Step
a
score           .386           1.263           .744      2.143
1       Constant        .199            .323
a. Variable(s) entered on step 1: score.

626
• Constant
– i.e. score = 0
– B = 1.314
– Exp(B) = eB = e1.314 = 3.720
– OR = 3.720, p = 1 – (1 / (OR + 1))
= 1 – (1 / (3.720 + 1))
– p = 0.788

627
• Score 1
– Constant b = 1.314
– Score B = -0.467
– Exp(1.314 – 0.467) = Exp(0.847)
= 2.332
– OR = 2.332
– p = 1 – (1 / (2.332 + 1))
= 0.699

628
Standard Errors and CIs
• SPSS gives
– B, SE B, exp(B) by default
– Can work out 95% CI from standard error
– B ± 1.96 x SE(B)
– Or ask for it in options
• Symmetrical in B
– Non-symmetrical (sometimes very) in
exp(B)

629
Va riables in the Equa tion

95.0% C.I. for
EXP(B)
B       S. E.   Ex p(B)       Lower    Upper
SCORE         -.467    .219      .627        .408      .962
Constan
1.314     .714     3.720
t
a. Variable(s) entered on s tep 1: SCORE.

630
• The odds of passing the test are
multiplied by 0.63 (CIs = 0.408, 0.962p
p = 0.033), for every additional point
on the aptitude test.

631
More on Standard Errors
• In OLS regression
– If a variable is added in a hierarchical fashion
– The p-value associated with the change in R2 is
the same as the p-value of the variable
– Not the case in logistic regression
• In our data 0.025 and 0.033
• Wald standard errors
– Make p-value in estimates is wrong – too high
– (CIs still correct)

632
• Two estimates use slightly different
information
– P-value says “what if no effect”
– CI says “what if this effect”
• Variance depends on the hypothesised ratio of the
number of people in the two groups
• Can calculate likelihood ratio based p-
values
– If you can be bothered
– Some packages provide them automatically
633
Probit Regression
• Very similar to logistic
– much more complex initial transformation
(to normal distribution)
– Very similar results to logistic (multiplied by
1.7)
• In SPSS:
– A bit weird
• Probit regression available through menus

634
– But requires data structured differently
• However
– Ordinal logistic regression is equivalent to
binary logistic
• If outcome is binary
– SPSS gives option of probit

635
Results
Estimate    SE      P

Logistic    Score       0.288     0.301   0.339
(binary)
Exp         0.147     0.073   0.043
Logistic    Score       0.288     0.301   0.339
(ordinal)   Exp         0.147     0.073   0.043
Logistic    Score       0.191     0.178   0.282
(probit)    Exp         0.090     0.042   0.033

636
Differentiating Between Probit
and Logistic
• Depends on shape of the error term
– Normal or logistic
– Graphs are very similar to each other
• Could distinguish quality of fit
– Given enormous sample size
• Logistic = probit x 1.7
– Actually 1.6998
– Understand the distribution
– Much simpler to get back to the probability

637
0
0.2
0.4
0.6
0.8
1
1.2
-3
-2.8
-2.6
-2.4
-2.2
-2
-1.8
-1.6
-1.4
-1.2
-1
-0.8
Logistic

-0.6
-0.4
-0.2
Normal (Probit)

0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
638

3
Infinite Parameters
• Non-convergence can happen because
of infinite parameters
– Insoluble model
• Three kinds:
• Complete separation
– The groups are completely distinct
• Pass group all score more than 10
• Fail group all score less than 10

639
• Quasi-complete separation
– Separation with some overlap
• Pass group all score 10 or more
• Fail group all score 10 or less
• Both cases:
– No convergence
• Close to this
– Curious estimates
– Curious standard errors

640
• Categorical Predictors
– Can cause separation
– Esp. if correlated
• Need people in every cell

Male                   Female

White      Non-White   White       Non-White
Below
Poverty
Line
Above
Poverty
Line                                                        641
Logistic Regression and
Diagnosis
• Logistic regression can be used for diagnostic
tests
– For every score
• Calculate probability that result is positive
• Calculate proportion of people with that score (or lower)
who have a positive result
• Calculate c statistic
– Measure of discriminative power
– %age of all possible cases, where the model gives
a higher probability to a correct case than to an
incorrect case
642
– Perfect c-statistic = 1.0
– Random c-statistic = 0.5
• SPSS doesn‟t do it automatically
– But easy to do
• Save probabilities
– Use Graphs, ROC Curve
– Test variable: predicted probability
– State variable: outcome

643
Sensitivity and Specificity
• Sensitivity:
– Probability of saying someone has a
positive result –
• If they do: p(pos)|pos
• Specificity
– Probability of saying someone has a
negative result
• If they do: p(neg)|neg

644
Calculating Sens and Spec
• For each value
– Calculate
• proportion of minority earning less – p(m)
• proportion of non-minority earning less – p(w)
– Sensitivity (value)
• P(m)

645
Salary   P(minority)
10         .39
20         .31
30         .23
40         .17
50         .12
60         .09
70         .06
80         .04
90         .03
646
Using Bank Data
• Predict minority group, using salary
(000s)
– Logit(minority) = -0.044 + salary x –0.039
• Find actual proportions

647
ROC Curve
1.0

0.8
Sensitivity

0.6

0.4

Area under curve
0.2                                          is c-statistic

0.0
0.0   0.2      0.4     0.6     0.8     1.0
1 - Specificity
Diagonal segments are produced by ties.
648
• Multinomial Logistic Regression more
than two categories in DV
– same procedure
– one category chosen as reference group
• odds of being in category other than reference
• Polytomous Logit Universal Models
(PLUM)
– Ordinal multinomial logistic regression
– For ordinal outcome variables
649
Final Thoughts
• Logistic Regression can be extended
– dummy variables
– non-linear effects
– interactions (even though we don‟t cover
them until the next lesson)
• Same issues as OLS
– collinearity
– outliers

650
651
652
Lesson 12: Mediation and Path
Analysis

653
Introduction
• Moderator
– Level of one variable influences effect of another
variable
• Mediator
– One variable influences another via a third
variable
• All relationships are really mediated
– are we interested in the mediators?
– can we make the process more explicit
654
• In examples with bank

beginning
education
salary

• Why?
– What is the process?
– Are we making assumptions about the
process?
– Should we test those assumptions?
655
job skills

expectations
beginning
education
salary
negotiating
skills

kudos
for bank

656
Direct and Indirect Influences
X may affect Y in two ways
• Directly – X has a direct (causal)
influence on Y
– (or maybe mediated by other variables)
• Indirectly – X affects Y via a mediating
variable - M

657
• e.g. how does going to the pub effect
comprehension on a Summer school
course
– on, say, regression
books on
regression
Having fun
less
in pub in
knowledge
evening

Anything
here?
658
books on
regression
Having fun
less
in pub in
knowledge
evening

fatigue

Still
needed?
659
• Mediators needed
– to cope with more sophisticated theory in
social sciences
processes
– examine direct and indirect influences

660
Detecting Mediation

661
4 Steps
From Baron and Kenny (1986)
• To establish that the effect of X on Y is
mediated by M
1. Show that X predicts Y
2. Show that X predicts M
3. Show that M predicts Y, controlling for X
4. If effect of X controlling for M is zero, M
is complete mediator of the relationship
•   (3 and 4 in same analysis)
662
Example: Book habits

Enjoy Books



663
Three Variables
• Enjoy
– How much an individual enjoys books
– How many books an individual buys (in a
year)
– How many books an individual reads (in a
year)

664
ENJOY       1.00  0.64     0.73

665
• The Theory

666
• Step 1
1. Show that X (enjoy) predicts Y (read)
– b1 = 0.487, p < 0.001
– standardised b1 = 0.732
– OK

667
2. Show that X (enjoy) predicts M (buy)
– b1 = 0.974, p < 0.001
– standardised b1 = 0.643
– OK

668
controlling for X (enjoy)
– b1 = 0.469, p < 0.001
– standardised b1 = 0.206
– OK

669
4. If effect of X controlling for M is zero,
M is complete mediator of the
relationship
– (Same as analysis for step 3.)
– b2 = 0.287, p = 0.001
– standardised b2 = 0.431
– Hmmmm…
•   Significant, therefore not a complete mediator

670
0.287
(step 4)

0.206
0.974
(from step 3)
(from step 2)

671
The Mediation Coefficient
• Amount of mediation =
Step 1 – Step 4
=0.487 – 0.287
= 0.200
• OR
Step 2 x Step 3
=0.974 x 0.206
= 0.200
672
SE of Mediator
a               b
(from step 2)   (from step 2)

• sa = se(a)
• sb = se(b)

673
• Sobel test
– Standard error of mediation coefficient can
be calculated

se  b s + a s - s s
2 2
a
2 2
b
2 2
a b
a = 0.974              b = 0.206
sa = 0.189             sb = 0.054

674
• Indirect effect = 0.200
– se = 0.056
– t =3.52, p = 0.001
• Online Sobel test:
http://www.unc.edu/~preacher/sobel/
sobel.htm
– (Won‟t be there for long; probably will be
somewhere else)

675
A Note on Power
• Recently
– Move in methodological literature away from this
conventional approach
– Problems of power:
– Several tests, all of which must be significant
• Type I error rate = 0.05 * 0.05 = 0.0025
• Must affect power
– Bootstrapping suggested as alternative
• See Paper B7, A4, B9
• B21 for SPSS syntax
676
677
678
Lesson 13: Moderators in
Regression
“different slopes for different
folks”

679
Introduction
• Moderator relationships have many
different names
– interactions (from ANOVA)
– multiplicative
– non-linear (just confusing)
• All talking about the same thing

680
A moderated relationship occurs
• when the effect of one variable
depends upon the level of another
variable

681
• Hang on …
– That seems very like a nonlinear relationship
– Moderator
• Effect of one variable depends on level of another
– Non-linear
• Effect of one variable depends on level of itself
• Where there is collinearity
– Can be hard to distinguish between them
– Paper in handbook (B5)
– Should (usually) compare effect sizes

682
• e.g. How much it hurts when I drop a
computer on my foot depends on
– x1: how much alcohol I have drunk
– x2: how high the computer was dropped
from
– but if x1 is high enough
– x2 will have no effect

683
• e.g. Likelihood of injury in a car
accident
– depends on
– x1: speed of car
– x2: if I was wearing a seatbelt
– but if x1 is low enough
– x2 will have no effect

684
30

25

20
Injury

15

10

5

0

5   15         25      35         45
Speed (mph)

Seatbelt        No Seatbelt

685
• e.g. number of words (from a list) I can
remember
– depends on
– x1: type of words (abstract, e.g. „justice‟, or
concrete, e.g. „carrot‟)
– x2: Method of testing (recognition – i.e.
multiple choice, or free recall)
– but if using recognition
– x1: will not make a difference

686
• We looked at three kinds of moderator
• alcohol x height = pain
– continuous x continuous
• speed x seatbelt = injury
– continuous x categorical
• word type x test type
– categorical x categorical
• We will look at them in reverse order

687
How do we know to look for
moderators?
Theoretical rationale
• Often the most powerful
effects
– Fewer predict moderator effects
Presence of heteroscedasticity
• Clue there may be a moderated
relationship missing                    688
Two Categorical Predictors

689
• 2 IVs               Data
– word type (concrete [1], abstract [2])
– test method (recog [1], recall [2])
• 20 Participants in one of four groups
–   1,   1
–   1,   2
–   2,   1
–   2,   2
• 5 per group
• lesson12.1.sav

690
Concrete Abstract Total
Mean           15.40    15.20    15.30
Recog
SD              2.19     2.59      2.26
Mean           15.60     6.60    11.10
Recall
Std. Deviation 1.67      7.44      6.95
Mean           15.50    10.90    13.20
Total
Std. Deviation 1.84      6.94      5.47

691
• Graph of means
18

16

14

12

10
WORDS
8
1.00

6                       2.00
1.00             2.00

TEST
692
ANOVA Results
• Standard way to analyse these data
would be to use ANOVA
– Words: F=6.1, df=1, 16, p=0.025
– Test: F=5.1, df=1, 16, p=0.039
– Words x Test: F=5.6, df=1, 16, p=0.031

693
Procedure for Testing
1: Convert to effect coding
• can use dummy coding, collinearity is
less of an issue
• doesn‟t make any difference to
substantive interpretation
2: Calculate interaction term
• In ANOVA interaction is automatic
• In regression we create an interaction
variable                                 694
• Interaction term (wxt)
– multiply effect coded variables together

word           test            wxt
-1             -1              1
1             -1             -1
-1             1              -1
1             1               1

695
3: Carry out regression
• Hierarchical
– linear effects first
– interaction effect in next block

696
• b0=13.2
• b1 (words) = -2.3, p=0.025
• b2 (test) = -2.1, p=0.039
• b3 (words x test) = -2.2, p=0.031
• Might need to use change in R2 to test
sig of interaction, because of collinearity
What do these mean?
• b0 (intercept) = predicted value of Y
(score) when all X = 0
– i.e. the central point

697
• b0 = 13.2
– grand mean
• b1 = -2.3
– distance from grand to mean for two word
types
– 13.2 – (-2.3) = 15.5
– 13.2 + (-2.3) = 10.9

Concrete Abstract Total
Recog       15.40     15.20    15.30
Recall      15.60      6.60    11.10
Total      15.50     10.90    13.20
698
• b2 = -2.1
– distance from grand mean to recog and
recall means
• b3 = -2.2
– to understand b3 we need to look at
predictions from the equation without this
term
Score = 13.2 + (-2.3)  w + (-2.1)  t

699
Score = 13.2 + (-2.3)  w + (-2.1)  t
• So for each group we can calculate an
expected value

700
b1 = -2.3, b2 = -2.1

W    T     Word   Test            Expected Value

C   Cog     -1     -1    13.2 + (-2.3) x (-1) + (-2.1) x -1

C   Call    -1     1       13.2 + (-2.3) x (-1) + (-2.1) x 1

A   Cog     1      -1      13.2 + (-2.3) x 1 + (-2.1) x (-1)

A   Call    1      1        13.2 + (-2.3) x 1 + (-2.1) x 1

701
W   T    Word Test Exp      Actual Value
C   Call   -1 -1       17.6         15.4
C   Cog    -1   1      13.4         15.6
A   Call    1 -1       13.0         15.2
A   Cog     1   1       8.8         11.0

• The exciting part comes when we look
at the differences between the actual
value and the value in the 2 IV model
702
• Each difference = 2.2 (or –2.2)
• The value of b3 was –2.2
– the interaction term is the correction
required to the slope when the second IV
is included

703
• Examine the slope for word type

18
16
14
12
10
8
6            (11.1 - 15.3) / 2 = -
4                     2.1
2
0
Recog (-1)                             Recall (1)

Test Type

704
• Add the slopes for two test groups

18
16
14
12
10           Both word
8          groups (-2.1)
6
4                                        Concrete
Abstract               (15.6-15.4 )/2
2
(6.6 - 15.2 )/2                = 0.1
0                = -4.3
Recog (-1)                                                Recall (1)

Test Type                     705
b associated with interaction
• the change in slope, away from the
average, associated with a 1 unit
change in the moderating variable
OR
• Half the difference in the slopes

706
• Another way to look at it
Y = 13.2 + -2.3w + -2.1t + -2.2wt
• Examine concrete words group (w = -1)
– substitute values into the equation

Y(concrete) = 13.2 + -2.3-1 + -2.1t + -2.2-1t
Y(concrete) = 13.2 + 2.3 + -2.1t + 2.2t
Y(concrete) = 15.5 + 0.1t
• The effect of changing test type for concrete
words (the slope, which is half the actual
difference)
707
Why go to all that effort? Why not do
ANOVA in the first place?
1. That is what ANOVA actually does
•   if it can handle an unbalanced design (i.e.
different numbers of people in each
group)
•   Helps to understand what can be done
with ANOVA
•   SPSS uses regression to do ANOVA
2. Helps to clarify more complex cases
•   as we shall see

708
Categorical x Continuous

709
Note on Dichotomisation
• Very common to see people dichotomise
a variable
– Makes the analysis easier
• Paper B6

710
Data
A chain of 60 supermarkets
• examining the relationship between
profitability, shop size, and local
competition
• 2 IVs
– shop size
– comp (local competition, 0=no, 1=yes)
• DV
– profit
711
• Data, „lesson 12.2.sav‟
Shopsize   Comp       Profit
4          1         23
10          1         25
7          0         19
10          0          9
10          1         18
29          1         33
12          0         17
6          1         20
14          0         21
62          0          8
712
1st Analysis
Two IVs
• R2=0.367, df=2, 57, p < 0.001
• Unstandardised estimates
– b1 (shopsize) = 0.083 (p=0.001)
– b2 (comp) = 5.883 (p<0.001)
• Standardised estimates
– b1 (shopsize) = 0.356
– b2 (comp) = 0.448
713
• Suspicions
– Presence of competition is likely to have an
effect
– Residual plot shows a little
heteroscedasticity
3

2

1

0

-1

-2

-3
-2.0   -1.5   -1.0   -.5   0.0   .5   1.0   1.5   2.0

714
Procedure for Testing
• Very similar to last time
– convert „comp‟ to effect coding
– -1 = No competition
– 1 = competition
– Compute interaction term
• comp (effect coded) x size
– Hierarchical regression

715
Result
• Unstandardised estimates
– b1 (shopsize) = 0.071 (p=0.006)
– b2 (comp) = -1.67 (p = 0.506)
– b3 (sxc) = -0.050 (p=0.050)
• Standardised estimates
– b1 (shopsize) = 0.306
– b2 (comp) = -0.127
– b3 (sxc) = -0.389
716
• comp now non-significant
– shows importance of hierarchical
– it obviously is important

717
Interpretation
• Draw graph with lines of best fit
– drawn automatically by SPSS
• Interpret equation by substitution of
values
– evaluate effects of
• size
• competition

718
40

30

20

10
Competition

No competition
Profit

0                                   All Shops
0     20   40   60   80   100

Shopsize

719
• Effects of size
– in presence and absence of competition
– (can ignore the constant)
Y=x10.071 + x2(-1.67) + x1x2 (-0.050)
– Competition present (x2 = 1)
Y=x10.071 + 1(-1.67) + x11 (-0.050)
Y=x10.071 + -1.67 + x1(-0.050)
Y=x1 0.021                 + (–1.67)

720
Y=x10.071 + x2(-1.67) + x1x2 (-0.050)
– Competition absent (x2 = -1)
Y=x10.071 + -1(-1.67) + x1-1 (-0.050)
Y=x1 0.071 + x1-1 (-0.050) + -1(-1.67)
Y= x1 0.121 (+ 1.67)

721
Two Continuous Variables

722
Data
• Bank Employees
– only using clerical staff
– 363 cases
– predicting starting salary
– previous experience
– age
– age x experience

723
• Correlation matrix
– only one significant

LOGSB AGESTART PREVEXP
LOGSB       1.00 -0.09     0.08
AGESTART   -0.09  1.00     0.77
PREVEXP     0.08  0.77     1.00

724
Initial Estimates (no moderator)
• (standardised)
– R2 = 0.061, p<0.001
– Age at start = -0.37, p<0.001
– Previous experience = 0.36, p<0.001
• Suppressing each other
– Age and experience compensate for one
another
– Older, with no experience, bad
– Younger, with experience, good

725
The Procedure
• Very similar to previous
– create multiplicative interaction term
– BUT
• Need to eliminate effects of means
– cause massive collinearity
• and SDs
– cause one variable to dominate the
interaction term
• By standardising
726
• To standardise x,
– subtract mean, and divide by SD
– re-expresses x in terms of distance from
the mean, in SDs
– ie z-scores
• Hint: automatic in SPSS in Descriptives
• Create interaction term of age and exp
– axe = z(age)  z(exp)

727
• Hierarchical regression
– two linear effects first
– moderator effect in second
– hint: it is often easier to interpret if
standardised versions of all variables are
used

728
• Change in R2
– 0.085, p<0.001
• Estimates (standardised)
– b1 (exp) = 0.104
– b2 (agestart) = -0.54
– b3 (age x exp) = -0.54

729
Interpretation 1: Pick-a-Point
• Graph is tricky
– can‟t have two continuous variables
– Choose specific points (pick-a-point)
• Graph the line of best fit of one variable at
others
– Two ways to pick a point
• 1: Choose high (z = +1), medium (z = 0) and
low (z = -1)
• Choose „sensible‟ values – age 20, 50, 80?

730
• We know:
– Y = e  0.10 + a  -0.54 + a  e  -0.54
– Where a = agestart, and e = experience
• We can rewrite this as:
– Y = (e  0.10) + (a  -0.54) + (a  e  -0.54)
– Take a out of the brackets
– Y = (e  0.10) + (-0.54 + e  -0.54)a
• Bracketed terms are simple intercept and simple
slope
– 0= (e  0.10)
– 1= (-0.54 + e  -0.54)a
– Y = 0 + 1a

731
• Pick any value of e, and we know the slope
for a
– Standardised, so it‟s easy
• e = -1
– 0= (-1  0.10) = -0.10
– 1= (-0.54 + -1  -0.54)a = -0.0a
• e=0
– 0= (0  0.10) = 0
– 1= (-0.54+ 0  -0.54)a = -0.54a
• e=1
– 0= (1  0.10) = 0.10
– 1= (-0.54 + 1  -0.54)a = -1.08a

732
Graph the Three Lines
1.5

1
e = -1
e=0
e=1
0.5
Log(salary)

0

-0.5

-1

-1.5
-1   -0.9   -0.8   -0.7   -0.6   -0.5   -0.4   -0.3   -0.2   -0.1    0    0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1
Age
733
Interpretation 2: P-Values and CIs

• Second way
• Calculate CIs of the slope
– At any point
• Calculate p-value
– At any point
• Give ranges of significance

734
What do you need?
• The variance and covariance of the
estimates
– SPSS doesn‟t provide estimates for
intercept
– Need to do it manually
• In options, exclude intercept
– Create intercept – c = 1
– Use it in the regression

735
• Enter information into web page:
– www.unc.edu/~preacher/interact/a
cov.htm
– (Again, may not be around for long)
• Get results
• Calculations in Bauer and Curran (in
press: Multivariate Behavioral Research)
– Paper B13

736
MLR 2-Way Interaction Plot

4.5
4.4
4.3
Y

4.2
4.1

CVz1(1)
CVz1(2)
CVz1(3)
4.0

-1.0             -0.5         0.0           0.5   1.0

X
737
Areas of Significance
Confidence Bands
0.4
0.2
Simple Slope

0.0
-0.2
-0.4
-0.6

-4   -2          0           2   4

Experience

738
• 2 complications
– 1: Constant differed
– 2: DV was logged, hence non-linear
• effect of 1 unit depends on where the unit is
– Can use SPSS to do graphs showing lines
of best fit for different groups
– See paper A2

739
Finally …

740
Unlimited Moderators
• Moderator effects are not limited to
– 2 variables
– linear effects

741
Three Interacting Variables
• Age, Sex, Exp
• Block 1
– Age, Sex, Exp
• Block 2
– Age x Sex, Age x Exp, Sex x Exp
• Block 3
– Age x Sex x Exp

742
• Results
– All two way interactions significant
– Three way not significant
– Effect of Age depends on sex
– Effect of experience depends on sex
– Size of the age x experience interaction
does not depend on sex (phew!)

743
Moderated Non-Linear
Relationships

• Enter non-linear effect
• Enter non-linear effect x moderator
– if significant indicates degree of non-
linearity differs by moderator

744
745
Modelling Counts: Poisson
Regression
Lesson 14

746
Counts and the Poisson
Distribution
• Von Bortkiewicz
(1898)
– Numbers of Prussian
120
soldiers kicked to    100
death by horses       80

60
0   109
40
1   65
20
2   22
3   3                0

4   1                     0   1   2   3   4   5

5   0

747
• The data fitted a Poisson probability distribution
– When counts of events occur, poisson distribution is
common
number of murders, ship accidents
• Common approach
– Log transform and treat as normal
• Problems
– Censored at 0
– Integers only allowed
– Heteroscedasticity

748
The Poisson Distribution
0.7

0.6

0.5
0.5
1
0.4                                                                    4
Probability

8

0.3

0.2

0.1

0
0   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15    16   17
Count
749
exp(   )    y
p ( y | x) 
y!

750
exp(   )                      y
p ( y | x) 
y!
• Where:
– y is the count
–  is the mean of the poisson distribution
• In a poisson distribution
– The mean = the variance (hence
heteroscedasticity issue))
–   2
751
Poisson Regression in SPSS
• Not directly available
– SPSS can be tweaked to do it in three ways:
– General loglinear model (genlog)
– Non-linear regression (CNLR)
• Bootstrapped p-values only

– Both are quite tricky
• SPSS 15,

752
Example Using Genlog
• Number of shark               25

bites on different            20
Blue

colour surfboards             15
Red

Frequency
– 100 surfboards, 50
red, 50 blue                10

• Weight cases by                    5

bites                              0

• Analyse, Loglinear,
0   1        2            3          4
Number of bites

General
– Colour is factor                                                          753
Results
Correspondence Between Parameters and
Terms of the Design
Parameter   Aliased Term

1    Constant
2    [COLOUR = 1]
3 x [COLOUR = 2]
Note: 'x' indicates an aliased (or a
redundant) parameter. These parameters
are set to zero.

754
Asymptotic                             95% CI
Param   Est.         SE      Z-value   Lower    Upper

1          4.1190   .1275    32.30     3.87     4.37
2          -.5495   .2108    -2.61     -.96     -.14
3           .0000   .          .        .        .

• Note: Intercept
(param 1) is curious
• Param 2 is the
difference in the
means
755
SPSS: Continuous Predictors
• Bleedin‟ nightmare
ils.cfm?tech_tan_id=100006204

756
Poisson Regression in Stata
• SPSS will save a Stata file
• Open it in Stata
• Statistics, Count outcomes, Poisson
regression

757
Poisson Regression in R
• R is a freeware program
– Similar to SPlus
– www.r-project.org
• Much nicer to do Poisson (and other) regression
analysis
http://www.stat.lsa.umich.edu/~faraway/book
/
http://www.jeremymiles.co.uk/regressionbook
/extras/appendix2/R/
758
• Commands in R
• Stage 1: enter data
– colour <- c(1, 0, 1, 0, 1, 0 … 1)
– bites <- c(3, 1, 0, 0, … )
• Run analysis
– p1 <- glm(bites ~ colour, family
= poisson)
• Get results
– summary.glm(p1)

759
R Results
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3567      0.1686 -2.115 0.03441 *
colour        0.5555     0.2116   2.625 0.00866 **

• Results for colour
– Same as SPSS
– For intercept different (weird SPSS)

760
Predicted Values
• Need to get exponential of parameter
estimates
– Like logistic regression
• Exp(0.555) = 1.74
– You are likely to be bitten by a shark 1.74
times more often with a red surfboard

761
Checking Assumptions
• Was it really poisson distributed?
– For Poisson,   2
• As mean increases, variance should also
increase
– Residuals should be random
• Overdispersion is common problem
• Too many zeroes
• For blue:   2 = exp(-0.3567) = 1.42
• For red:   2 = exp(-0.3567 + 0.555)
= 2.48                                762
exp(   )    y
p ( y | x) 
y!

• Strictly:
exp(   ) 
ˆ ˆ     y
p( yi | xi ) 
y!

763
Compare Predicted with Actual
Distributions

Blue                                                    Red

0.7                                                        0.4

0.6                                                   0.35

0.3
0.5                               Expected
0.25
Probability

Probability
Actual
0.4
0.2
0.3
0.15
Expected
0.2                                                                  Actual
0.1
0.1                                                   0.05

0                                                      0
0   1       2       3         4                        1    2               3   4
Frequency                                               Frequency

764
Overdispersion
• Problem in poisson regression
– Too many zeroes
• Causes
– c2 inflation
– Standard error deflation
• Hence p-values too low
– Higher type I error rate
• Solution
– Negative binomial regression

765
Using R
• R can read an SPSS file
– But you have to ask it nicely
choose “Foreign”
• Click File, Change Dir
– Change to the folder that contains your
data

766
More on R
• R uses objects
– To place something into an object use <-
– X <- Y
• Puts Y into X
• Variables are then referred to as
Mydata\$VAR1
– Note 1: R is case sensitive
– Note 2: SPSS variable name in capitals
767
GLM in R
• Command
– glm(outcome ~ pred1 + pred2 + … +
predk [,family = familyname])
– If no familyname, default is OLS
• Use binomial for logistic, poisson for poisson
• Output is a GLM object
– You need to give this a name
– my1stglm <- glm(outcome ~ pred1 +
pred2 + … + predk [,family =
familyname])                                       768
• Then need to explore the result
– summary(my1stglm)
• To explore what it means
– Need to plot regressions
• Easiest is to use Excel

769
770
Introducing Structural
Equation Modelling
Lesson 15

771
Introduction
• Related to regression analysis
– All (OLS) regression can be considered as a
special case of SEM
• Power comes from adding restrictions
to the model
• SEM is a system of equations
– Estimate those equations

772
Regression as SEM
– Grade = constant + books + attend +
error
• Looks like a regression equation
– Also
– Books correlated with attend
– Explicit modelling of error

773
Path Diagram
• System of equations are usefully
represented in a path diagram

x      Measured variable

e     unmeasured variable

regression

correlation
774
Path Diagram for Regression
Must usually
explicitly
model error
error

Books

Attend

Must explicitly
model correlation
775
Results
• Unstandardised

2.00                               1.00
e
BOOKS
4.04    13.52

17.84   1.28

ATTEND

776
Standardised

e
BOOKS
.35     .82

.33

ATTEND

777
Table
Estimate       S.E.         C.R.          P        St. Est.
GRADE     <-- BOOKS           4.04         1.71        2.36            0.02      0.35
GRADE     <-- ATTEND          1.28         0.57        2.25            0.03      0.33
GRADE     <-- e              13.52         1.53        8.83            0.00      0.82

Coefficientsa

Unstandardized          Standardized
Coefficients           Coefficients
Model                   B        Std. Error         Beta             Sig.
1       (Constant)     37.38          7.74                              .00
BOOKS            4.04         1.75                 .35          .03
ATTEND           1.28            .59               .33          .04
So What Was the Point?
• Regression is a special case
• Lots of other cases
• Power of SEM
– Power to add restrictions to the model
• Restrict parameters
– To zero
– To the value of other parameters
– To 1
779
Restrictions
• Questions
– Is a parameter really necessary?
– Are a set of parameters necessary?
– Are parameters equal
• Each restriction adds 1 df
– Test of model with c2

780
The c2 Test
• Can the model proposed have
generated the data?
– Test of significance of difference of model
and data
– Statistically significant result
– Theoretically driven

781
Regression Again
0, 1
e
BOOKS

ATTEND

• Both estimates restricted to zero

782
• Two restrictions
– 2 df for c2 test
– c2 = 15.9, p = 0.0003
• This test is (asymptotically) equivalent
to the F test in regression
– We still haven‟t got any further

783
Multivariate Regression

y1
x1
y2

x2
y3

784
Test of all x’s on all y’s
(6 restrictions = 6 df)

y1
x1
y2

x2
y3

785
Test of all x1 on all y’s
(3 restrictions)

y1
x1
y2

x2
y3

786
Test of all x1 on all y1
(3 restrictions)

y1
x1
y2

x2
y3

787
Test of all 3 partial correlations between
y’s, controlling for x’s
(3 restrictions)

y1
x1
y2

x2
y3

788
Path Analysis and SEM
• More complex        ENJOY
more restrictions
1
– E.g. mediator
model
• 1 restriction
1
– No path from

789
Result
• c2 = 10.9, 1 df, p = 0.001
• Not a complete mediator

790
Multiple Groups
• Same model
– Different people
• Equality constraints between groups
– Means, correlations, variances, regression
estimates
– E.g. males and females

791
Multiple Groups Example
• Age
• Severity of psoriasis
– SEVE – in emotional areas
• Hands, face, forearm
– SEVNONE – in non-emotional areas
– Anxiety
– Depression

792
Correlationsa

AGE          SEVE         SEVNONE    GHQ_A       GHQ_D
AGE            Pearson Correlation         1        -.270         -.248     .017        .035
Sig. (2-tailed)              .        .004         .009      .859        .717
N                       110           110           110       110         110
SEVE           Pearson Correlation    -.270                1      .665      .045        .075
Sig. (2-tailed)        .004                 .      .000      .639        .436
N                       110           110           110       110         110
SEVNONE        Pearson Correlation    -.248          .665            1      .109        .096
Sig. (2-tailed)        .009           .000             .     .255        .316
N                       110           110           110       110         110
GHQ_A          Pearson Correlation    .017           .045         .109         1        .782
Sig. (2-tailed)        .859           .639         .255            .     .000
N                       110           110           110       110         110
GHQ_D          Pearson Correlation    .035           .075         .096      .782           1
Sig. (2-tailed)        .717           .436         .316      .000              .
N                       110           110           110       110         110
a. SEX = f

793
Correlationsa

AGE          SEVE         SEVNONE    GHQ_A       GHQ_D
AGE        Pearson Correlation         1        -.243         -.116     -.195       -.190
Sig. (2-tailed)              .        .031         .310      .085        .094
N                           79          79           79        79          79
SEVE       Pearson Correlation    -.243                1      .671      .456        .453
Sig. (2-tailed)        .031                 .      .000      .000        .000
N                           79          79           79        79          79
SEVNONE    Pearson Correlation    -.116          .671            1      .210        .232
Sig. (2-tailed)        .310           .000             .     .063        .040
N                           79          79           79        79          79
GHQ_A      Pearson Correlation    -.195          .456         .210         1        .800
Sig. (2-tailed)        .085           .000         .063            .     .000
N                           79          79           79        79          79
GHQ_D      Pearson Correlation    -.190          .453         .232      .800           1
Sig. (2-tailed)        .094           .000         .040      .000              .
N                           79          79           79        79          79
a. SEX = m

794
Model
AGE

SEVE         SEVNONE
1                            1
e_s                          e_sn

Dep            Anx
1                            1
E_d                          e_a

795
Females
AGE

-.27               -.25

SEVE                     SEVNONE
.96          .07               .04               .97

e_s                                                e_sn

.03          .09 -.04               .15

.64

Dep                            Anx
.99                                              .99

E_d                                                 e_a

.78
796
AGE
Males                  -.24               -.12

SEVE                     SEVNONE
.97          -.08              -.08            .99

e_s                                              e_sn

.52          -.12 .55            -.17

.67

Dep                         Anx
.88                                            .88

E_d                                               e_a

.74
797
Constraint
• sevnone -> dep
– Constrained to be equal for males and
females
• 1 restriction, 1 df
– c2 = 1.3 – not significant
• 4 restrictions
– 2 severity -> anx & dep

798
• 4 restrictions, 4 df
– c2 = 1.3, p = 0.014
• Parameters are not equal

799

• SEM programs tend to deal with missing
data
– Multiple imputation
– Full Information (Direct) Maximum
Likelihood
• Asymptotically equivalent
• Data can be MAR, not just MCAR

800
• Power for regression gets tricky with
large models
• With SEM power is (relatively) easy
– It‟s all based on chi-square
– Paper B14

801
Lesson 16: Dealing with clustered
data & longitudinal models

802
The Independence
Assumption
• In Lesson 8 we talked about independence
– The residual of any one case should not tell you
about the residual of any other case
• Particularly problematic when:
– Data are clustered on the predictor variable
• E.g. predictor is household size, cases are members of
family
• E.g. Predictor is doctor training, outcome is patients of
doctor
– Data are longitudinal
• Have people measured over time
– It‟s the same person!
803
Clusters of Cases
• Problem with cluster (group)
randomised studies
– Or group effects
• Use Huber-White sandwich estimator
– Tell it about the groups
– Use complex samples in SPSS

804
Complex Samples
• As with Huber-White for heteroscedasticity
– Put it into clusters
• Run GLM
– As before
• Warning:
– Need about 20 clusters for solutions to be stable

805
Example
• People randomised by week to one of two
forms of triage
– Compare the total cost of treating each
• Ignore clustering
– Difference is £2.40 per person, with 95%
confidence intervals £0.58 to £4.22, p =0.010
• Include clustering
– Difference is still £2.40, with 95% CIs £5.65 to -
£0.85, and p = 0.141.
• Ignoring clustering led to type I error
806
Longitudinal Research
• For comparing
repeated measures       ID   V1   V2   V3   V4
– Clusters are people   1    2    3    4    7
– Can model the         2    3    6    8    4
repeated measures
over time             3    2    5    7    5
• Data are usually
short and fat

807
Converting Data
ID   V   X
1    1   2
• Change data to tall   1    2   3
and thin              1    3   4
1    4   7
• Use Data,             2    1   3
Restructure in        2    2   6
SPSS                  2    3   8
2    4   4
• Clusters are ID
3    1   2
3    2   5
3    3   7
3    4   5

808
(Simple) Example
• Use employee data.sav
– Compare beginning salary and salary
– Would normally use paired samples t-test
• Difference = \$17,403, 95% CIs
\$16,427.407, \$18,379.555

809
Restructure the Data
• Do it again                 ID       Time   Cash
– With data tall and thin        1     1    \$18,750
• Complex GLM with                 1     2    \$21,450
Time as factor                   2     1    \$12,000

– ID as cluster                  2     2    \$21,900

• Difference = \$17,430,            3     1    \$13,200

95% CIs =                        3     2    \$45,000

16427.407, 18739.555

810
Interesting …
• That wasn‟t very interesting
– What is more interesting is when we have
multiple measurements of the same people
• Can plot and assess trajectories over
time

811
Single Person Trajectory

+

+     +
+
+
+

Time
812
Multiple Trajectories: What‟s the
Mean and SD?

Time
813
Complex Trajectories
• An event occurs
– Can have two effects:
– A jump in the value
– A change in the slope
• Event doesn‟t have to happen at the
same time for each person
– Doesn‟t have to happen at all

814
Slope 1                  Jump

Slope 2

Event Occurs

815
Parameterising
Time   Event   Time2   Outcome
1      0       0        12
2      0       0        13
3      0       0        14
4      0       0        15
5      0       0        16
6      1       0        10
7      1       1         9
8      1       2         8
9      1       3         7     816
Draw the Line

What are the parameter estimates?
817
Main Effects and Interactions

• Main effects
– Intercept differences
• Moderator effects
– Slope differences

818
Multilevel Models
• Fixed versus random effects
– Fixed effects are fixed across individuals
(or clusters)
– Random effects have variance
• Levels
– Level 1 – individual measurement
occasions
– Level 2 – higher order clusters
819
More on Levels
• NHS direct study
– Level 1 units: …………….
– Level 2 units: ……………
• Widowhood food study
– Level 1 units ……………
– Level 2 units ……………

820
More Flexibility
• Three levels:
– Level 1: measurements
– Level 2: people
– Level 3: schools

821
More Effects
• Variances and covariances of effects
• Level 1 and level 2 residuals
– Makes R2 difficult to talk about
• Outcome variable
– Yij
• The score of the ith person in the jth group

822
Y    i   j
2.3   1   1
3.2   2   1
4.5   3   1
4.8   1   2
7.2   2   2
3.1   3   2
1.6   4   2

823
Notation
• Notation gets a bit horrid
– Varies a lot between books and programs
• We used to have b0 and b1
– If fixed, that‟s fine
– If random, each person has their own
intercept and slope

824
Standard Errors
• Intercept has standard errors
• Slopes have standard errors
• Random effects have variances
– Those variances have standard errors
• Is there statistically significant variation
between higher level units (people)?
• OR
• Is everyone the same?

825
Programs
• Since version 12
– Can do this in SPSS
– Can‟t do anything really clever
– Completely unusable
– Have to use syntax

826
SPSS Syntax
• MIXED
• relfd with time
• /fixed = time
• /random = intercept time | subject
(id)   covtype(un)
• /print = solution.

827
SPSS Syntax
• MIXED
• relfd with time

Continuous
Outcome         predictor

828
SPSS Syntax
• MIXED
• relfd with time
• /fixed = time

Must specify effect as
fixed first

829
SPSS Syntax
• MIXED
• relfd with time
• /fixed = time
• /random = intercept time | subject
(id)   covtype(un) are random
Intercept and
time
Specify random
effects
SPSS assumes that your
level 2 units are subjects,
and needs to know the id
variable             830
SPSS Syntax
• MIXED
• relfd with time
• fixed = time
• /random = intercept time | subject
(id) covtype(un)
Covariance matrix of random
effects is unstructured.
(Alternative is id – identity or vc
– variance components).
831
SPSS Syntax
• MIXED
• relfd with time
• fixed = time
• /random = intercept time | subject
(id) covtype(un)
• /print = solution.

832
The Output
• Information criteria
– We‟ll come back
a
Information Crite ria

-2 Restricted Log
64899.758
Likelihood
Akaike's Information
64907.758
Criterion (AIC)
Hurvich and Tsai's
64907.763
Criterion (AICC)
Bozdogan's Criterion
64940.134
(CAIC)
Schwarz's Bayesian
64936.134
Criterion (BIC)
The information criteria are displayed in smaller-is-better forms.
a. Dependent Variable: relfd.
833
Fixed Effects
• Not useful here, useful for interactions
a
Type III Tests of Fixed Effects

Denominator
Source      Numerator df       df               F       Sig.
Intercept              1             741    3251.877      .000
time                   1          741.000       2.550     .111
a. Dependent Variable: relfd.

834
Estimates of Fixed Effects
• Interpreted as regression equation
a
Estimate s of Fixe d Effects

95% Confidence
Interval
Std.                               Lower    Upper
Parameter   Estimate     Error     df       t        Sig.   Bound    Bound
Intercept      21.90     21.90     .38   57.025      .000    21.15    22.66
time             -.06      -.06    .04    -1.597     .111     -.14      .01
a. Dependent Variable: relfd.

835
Covariance Parameters
a
Estimate s of Covariance Parame ters

Parameter                   Estimate   Std. Error
Residual                    64.11577 1.0526353
Intercept +     UN (1,1)    85.16791 5.7003732
time [subject   UN (2,1)    -4.53179   .5067146
= id]
UN (2,2)    .7678319   .0636116
a. Dependent Variable: relfd.

836
Change Covtype to VC
• We know that this is wrong
– The covariance of the effects was statistically
significant
– Can also see if it was wrong by comparing
information criteria
• We have removed a parameter from the
model
– Model is worse
– Model is more parsimonious
• Is it much worse, given the increase in parsimony?
837
VC Model
UN Model
a                                            a
Information Crite ria                        Information Crite ria

-2 Restricted Log                           -2 Restricted Log
64899.758                                   65041.891
Likelihood                                  Likelihood
Akaike's Information                        Akaike's Information
64907.758                                   65047.891
Criterion (AIC)                             Criterion (AIC)
Hurvich and Tsai's                          Hurvich and Tsai's
64907.763                                   65047.894
Criterion (AICC)                            Criterion (AICC)
Bozdogan's Criterion                        Bozdogan's Criterion
64940.134                                   65072.173
(CAIC)                                      (CAIC)
Schwarz's Bayesian                          Schwarz's Bayesian
64936.134                                   65069.173
Criterion (BIC)                             Criterion (BIC)
The information criteria are displayed in smaller-
The information criteria are displayed in smaller-is-better forms.
a. Dependent Variable: relfd.                 a. Dependent Variable: relfd.
Lower is better.
838
• So far, all a bit dull
• We want some more predictors, to make it
more exciting
– E.g. female
Relfd with time female
/fixed = time sex time * sex
• What does the interaction term represent?

839
Extending Models
• Models can be extended
– Any kind of regression can be used
• Logistic, multinomial, Poisson, etc
– More levels
• Children within classes within schools
• Measures within people within classes within prisons
– Multiple membership / cross classified models
• Children within households and classes, but households
not nested within class
• Need a different program
– E.g. MlwiN
840
MlwiN Example (very quickly)

841
Books
Singer, JD and Willett, JB (2003). Applied
Longitudinal Data Analysis: Modeling Change
and Event Occurrence. Oxford, Oxford
University Press.
Examples at:
http://www.ats.ucla.edu/stat/SPSS/ex
amples/alda/default.htm

842
The End

843

```
To top