# Introduction to Linear Models

Document Sample

```					Introduction to Linear Models

Dr Peter Baghurst
Public Health Research Unit
Extracting the „signal‟ from the
„noise‟
response
= signal + noise
measurement

The term „signal‟ refers to the systematic component

A consulting statistician will often want to describe this
systematic component in terms of a linear model.
What is a linear model?

The expected value (Y)                      a linear combination
of                     =        of explanatory variables
the „response‟ variable                          X1, X2, X3…

Y                   =        b1xX1 + b2xX2 + b3xX3 + ...
we are usually interested in the estimation and
interpretation of the coefficients b1, b2, b3, etc

Note: While many texts use the words „dependent‟ and „independent‟
variables, these words have special meaning in a statistical context
and it is better to avoid their use here.
Example: simple linear
regression
Y = β 0 + β 1X

intercept           slope

Y

X
Effect of vitamin D on calcium absorbtion
0.5 Cabsortin 1.0 1.5

50         100150200

vit a C Nordin 2004
mi    n             D
Example: Calcium absorbtion
and vitamin D

Calcium     constant                              vitamin
slope X
absorbtion = term X 1                +               D

Calcium absorbtion is a linear combination of „1‟ and vitamin D level.
The systematic component or signal can be calculated from estimates
of the constant term and the slope term.
might look like this...
Y                 X1                        X2
(= Ca absorbtion)       ( = 1)               ( = vitamin D)
Subject 1          0.52                1                        55
Subject 2          0.73                1                        78
Subject 3          0.82                1                        97
Subject 4          0.85                1                       105
Subject 5          0.96                1                       120
Subject 6          0.97                1                       112
Subject 7          1.15                1                       140
Subject 8          1.23                1                       163
Subject 9          1.39                1                       181

Calcium         constant                       vitamin
absorbtion     = term       X   1   +   slope X D
Example 2: SIDS organ weight data
o20rganweiht 30 40 50
Y non-SIDS

Y
All

Y SIDS
0 10

non SI D S
-S I D S
A simple t-test for comparing two
groups uses linear models
Compares the very simplest linear model:-

Y = β0
(all observations are from the same population)

with the model:-

Y = β1 - if study-subject is from Group 1
Y = β2 - if study subject is from Group 2

(separate effects for Group 1 and Group 2)
(The SIDS data provide an example of a category variable)
We can, of course, have both
continuous and categorical variables in
the same linear model
Y                X1                 X2
(= birthweight) ( = weeks gestation) ( = multiple)
Subject 1        3010              39             singleton
Subject 2        975               27                twin
:              :                :                  :
:              :                :                  :
:              :                :                  :
:              :                :                  :
:              :                :                  :
:              :                :                  :
Subject 50       4100              40             singleton
Birth weight example

A simple linear model which may be used as a
Starting point for describing such data may be:-

Birth weight = constant + b1.gestational age + b2.twin effect
Operationalising categorical variables:
(the SIDS organ-weight data)
Y                  X2
(= Organ wt)          ( = SIDS)
Subject 1        0.52               yes
Subject 2        0.73               yes
Subject 3        0.82               yes
Subject 4        0.85               yes
Subject 5        0.96               no
Subject 6        0.97               no
Subject 7        1.15               no
Subject 8        1.23               no
Subject 9        1.39               no

Question: How do we operationalise the yes/no for linear modelling?
Operationalising category
variables
Variable SIDS has value „yes‟ if the baby died of SIDS
„no‟ if the baby died of other causes

One way would be to define a new data variable which takes the
value 1 if SIDS = „yes‟, and the value 0 if SIDS = „no‟

Y = constant x 1 +[a coefficient] x   {1
0
- SIDS death
- non-SIDS death
What is this „coefficient‟?
Y = constant x 1 + SIDS x {1 - SIDS death
effect  0 - non-SIDS death
Example: SIDS organ weights
Linear models:-

(1)   Organ weight (Y) = constant

Y = constant x 1

(2)   Organ weight (Y) = constant + „a SIDS effect‟
1 - SIDS death
Y = constant x 1 + SIDS x
effect
{   0 - non-SIDS death
Y                 X1                       X2
(= Organ wt)           ( = 1)      ( = SIDS: 1=yes, 0=no)
Subject 1       0.52                 1                        1
Subject 2       0.73                 1                        1
Subject 3       0.82                 1                        1
Subject 4       0.85                 1                        1
Subject 5       0.96                 1                        0
Subject 6       0.97                 1                        0
Subject 7       1.15                 1                        0
Subject 8       1.23                 1                        0
Subject 9       1.39                 1                        0

Y = constant x 1 + SIDS x {1 - SIDS death
effect  0 - non-SIDS death
X1 and X2 are often referred to as “dummy” variables
However there are other ways of
„operationalising‟ a category variable
So with a mind to category variables or „factors‟ with
more than two categories we look at an alternative…

For the organ weight data we define two new data
variables which take on the values 0 or 1, depending
on the cause of death, and which give rise to the model:-

SIDS               non-SIDS
Organ wt =   constant x 1 +          x{0 or 1} +         x{0 or 1}
effect             effect

This linear model appears to require us to estimate 3
coefficients
for the same analysis
Y              X1               X2               X3
(= Organ wt)       ( = 1)          ( = SIDS)     ( = non-SIDS)
Subject 1       0.52              1                1                0
Subject 2       0.73              1                1                0
Subject 3       0.82              1                1                0
Subject 4       0.85              1                1                0
Subject 5       0.96              1                0                1
Subject 6       0.97              1                0                1
Subject 7       1.15              1                0                1
Subject 8       1.23              1                0                1
Subject 9       1.39              1                0                1

SIDS               non-SIDS
Organ wt =    constant x 1 +          x{0 or 1} +         x{0 or 1}
effect             effect
For a category variable many linear
models give the same predicted vaues
Suppose the average organ wt for SIDS victims is 30,
and for other deaths is 15
linear model coefficients            predicted
organ weight
constant     SIDS      non-SIDS    SIDS       non-SIDS
„effect‟    „effect‟   death        death
5          25          10       30           15
30          0          -15       30           15
15          15           0       30           15
0          30          15       30           15
22.5        7.5         -7.5      30           15
But note the difference between the SIDS effect and
the non-SIDS effects is always 15
Statistical packages differ in their handling
of categorical variables or factors
Mercifully, most packages generate the „dummy variables‟
“behind the scenes”
Response Category variable   X1   X2   X3   X4
variable Y   (4 levels)
Subject 1     15.3           1           1    0    0        0
Subject 2     17.6           1           1    0    0        0
Subject 3     22.4           2           0    1    0        0
Subject 4     21.3           2           0    1    0        0
Subject 5     26.4           3           0    0    1        0
Subject 6     25.9           3           0    0    1        0
Subject 7     32.3           4           0    0    0        1
Subject 8     34.6           4           1    0    0        1
Statistical packages differ in their handling
of categorical variables or factors (2)
   By default the coefficient of the first (or last) dummy variable
corresponding to the first (or last) level of the factor is often set
to zero by the package

   In setting a coefficient to zero, the corresponding level of the
factor becomes the „reference‟ level to which the „effect
estimates‟ for the other levels are compared.

   Some packages offer a method for selecting an „in-between‟
level of a factor with 3 or more levels as the reference
(although sometimes this is available only in the scripting
facility as opposed to the „point-and-click‟ user interface).
Example: comparing 3 drug treatments
One linear model for each subject in the trial is:-
0
Y = constant + drug 1
effect      {}  or
1
+ drug 2
effect  {}
0
or
1
+ drug 3
effect   {}
0
or
1

Some analysis packages fix the estimate for drug 1 at zero.
This forces the „real‟ effect of drug 1 into the constant
term –it is important to realise this! – so the model becomes:-
0
Y = constant + drug 2
effect       {} or
1
+ drug 3
effect   {}
0
or
1
(Take-home message: you need to know what your analysis package does!)
Should I treat my measurements as a
continuous or categorical variable?
25

20
response variable Y

15

10

5

0
0   1   2         3        4        5    6   7
category variable X (6 levels)
“Ordinal” categorical variables
   When the individual levels of a category-variable
or factor have a natural order such as „<20 years‟,
„20 – 30 years‟, „>30 years‟, we call it an „ordinal
factor‟ – and the order of the levels is meaningful,
because we may be interested in trends with age

   Compare this with ,say, the factor „eye-colour‟
which has values like, „brown‟, „blue‟, „hazel‟;-
for most analytic purposes these could appear in
any order
Categorical treatment
25

20

response variable Y
15

10

5

0
1   2
1           3
1          4
1             5
1   6
2
category variable X (6 levels)

Linear models, with X treated categorically, yield
„effects‟ equal to the differences between bar heights
and the bar-height for the„reference‟ group
If we treat our variable as
continuous….ie Y = a + bX
25

20

Straight line
response variable Y

15
does not fit
10
well at all!

5
Many of these common
problems would never
arise in the first place if                         0
the relevant graphs were                                 11      12        3
1         4
1             15   26

plotted!                                                          category variable X (6 levels)
Note on „trend‟ statistics
While the previous discussion has suggested that
the fitting of a categorical variable as if it were
continuous is a bad thing to do, this is exactly
what is done when testing for (linear) trend across
categories in very noisy data.

Here the emphasis is not so much on fitting an
accurate model to the data - but on deciding
whether there is a significant linear component of
„dose-response’ across categories which contain
ordered information
The term „linear‟ in linear models should not imply
we only use them to fit straight lines!
   Y = b0 + b1.X fits a straight line

   Y = b0 + b1.X + b2.X2 fits a simple curve (called a

   Y = b0 + b1.X + b2.X2 + b3.X3 (a cubic
polynomial) fits more complex curvature

   To fit very complex curvature we can string cubic
polynomials together using „splines‟
Sometimes models which look
by a simple transformation

 Y = a.exp{bX} does not look like a linear
model, but
 log(Y) = log(a) + b.X is clearly a linear
model in X with constant term log(a) and
log(Y) the new response variable
Linear models are useful for non-
continuous response variables as well

Binary or yes/no events
– we model the log of the odds, p/(1-p), of the
probability of the event as a linear model, ie:
– log{p/(1-p)} = bo + b1.X 1+…
– (this is called logistic regression and log{p/(1-p)}
is called the logit of p)
   Example: dependence of the ways babies are
delivered on the age of the mother
Why model the logit(p) and not
just „p‟, the probability?
   Partly because of a very theoretical reason, this is
a „natural‟ thing to do
   Partly because of computational reasons, (the logit
of a probability can vary between ± infinity,
whereas p, by definition, must lie between 0 and 1
   Partly because the anti-log or exponential of the
coefficients of a linear model for logit(p) may be
simply interpreted as “odds ratios” in many
situations
SA    P rim

95                 3

90
2
Elect iv
80           Eme   rg
Va  i
g n
1  a
70
60
50                 0
40

logit(p)
30
-1
Proptin(%)

20
15
-2
10

6
-3
4
3
2              -4

15202530354045

Mat ernal          A
Linear models for non-continuous
response variables (continued)
   Count data (Poisson regression)
– we model the log of the number of events in a
follow up time t (or over an area A, say) as
– log(n) = c0 + c1.X1 + c2.X2 + …
   Survival analysis (Cox proportional
hazards models)
– We model the “instantaneous hazard” h(t) of
an event occurring at time t (given that an
individual has already survived to time t) as
– log{h(t)} = log{h0(t)} + d1.X1 + d2.X2 + …
(reference group)

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 7 posted: 2/16/2010 language: English pages: 32
How are you planning on using Docstoc?