# Regression: (2) Multiple Linear Regression and Path Analysis

Document Sample

```					        Regression:
(2) Multiple Linear Regression
and Path Analysis
BIOL4062/5062
Multiple Linear Regression and Path
Analysis
• Multiple linear regression
–   assumptions
–   parameter estimation
–   hypothesis tests
–   selecting independent variables
–   collinearity
–   polynomial regression
• Path analysis
Regression
One Dependent Variable     Y

Independent Variables      X1,X2,X3,...
Purposes of Regression
1. Relationship between Y and X's
2. Quantitative prediction of Y
3. Relationship between Y and X controlling
for C
4. Which of X's are most important?
5. Best mathematical model
6. Compare regression relationships:
Y1 on X, Y2 on X
7. Assess interactive effects of X's
• Simple regression: one X

• Multiple regression: two or more X's

Y = ß0 + ß1 X(1) + ß2 X(2) + ß3 X(3) + ... + ßk X(k) + E
Multiple linear regression:
assumptions (1)
• For any specific combination of X's, Y is a
(univariate) random variable with a certain
probability distribution having finite mean
and variance (Existence)
• Y values are statistically independent of one
another (Independence)
• Mean value of Y given the X's is a straight
linear function of the X's (Linearity)
Multiple linear regression:
assumptions (2)
• The variance of Y is the same for any fixed
combinations of X's (Homoscedasticity)
• For any fixed combination of X's, Y has a
normal distribution (Normality)
• There are no measurement errors in the X's
(Xs measured without error)
Multiple linear regression:
parameter estimation
Y = ß0 + ß1 X(1) + ß2 X(2) + ß3 X(3) + ... + ßk X(k) + E

• Estimate the ß's in multiple regression using least
squares
• Sizes of the coefficients not good indicators of
importance of X variables
• Number of data points in multiple regression
– at least one more than number of X’s
– preferably 5 times number of X’s
Why do Large Animals have Large Brains?
(Schoenemann Brain Behav. Evol. 2004)
LMASS   LFAT   LMUSCLE   LHEART   LBONE

LCNS
LCNS

N=39
LMASS   LFAT   LMUSCLE   LHEART   LBONE

Multiple regression of Y [Log (CNS)] on:
X’ s                          ß         SE(ß)
Log(Mass)                     -0.49     (0.70)
Log(Fat)                      -0.07     (0.10)
Log(Muscle)                   1.03      (0.54)
Log(Heart)                    0.42      (0.22)
Log(Bone)                     -0.07     (0.30)
Multiple linear regression:
hypothesis tests
Usually test:

H0: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßj⋅X(j) + E

H1: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßj⋅X(j) + ... + ßk⋅X(k) + E

F-test with k-j, n-(k-j)-1 degrees of freedom (“partial F-test”)

H0: variables X(j+1),…,X(k) do not help explain variability in Y
Multiple linear regression:
hypothesis tests
e.g. Test significance of overall multiple regression

H0: Y = ß0 + E

H1: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßk⋅X(k) + E

• Test significance of
– deleting independent variable
Why do Large Animals have Large Brains?
(Schoenemann Brain Behav. Evol. 2004)
LMASS   LFAT   LMUSCLE   LHEART   LBONE

LCNS
LCNS

LMASS   LFAT   LMUSCLE   LHEART   LBONE

Multiple regression of Y [Log (CNS)] on:
X’ s                          ß         SE(ß)              P
Tests
Log(Mass)                     -0.49     (0.70)             0.49   whether
removal
Log(Fat)                      -0.07     (0.10)             0.52   of
Log(Muscle)                   1.03      (0.54)             0.07   variable
reduces
Log(Heart)                    0.42      (0.22)             0.06   fit
Log(Bone)                     -0.07     (0.30)             0.83
Multiple linear regression:
selecting independent variables
• Reasons for selecting a subset of
independent variables (X’s):
–   cost (financial and other)
–   simplicity
–   improved prediction
–   improved explanation
Multiple linear regression:
selecting independent variables
• Partial F-test
–   predetermined forward selection
–   forward selection based upon improvement in fit
–   backward selection based upon improvement in fit
–   stepwise (backward/forward)
• Mallow’s C(p)
• AIC
Multiple linear regression:
selecting independent variables
• Partial F-test
– predetermined forward selection
• Mass, Bone, Heart, Muscle, Fat
– forward selection based upon improvement in fit
– backward selection based upon improvement in fit
– Stepwise (backward/forward)
Multiple linear regression:
selecting independent variables
• Partial F-test
–   predetermined forward selection
–   forward selection based upon improvement in fit
–   backward selection based upon improvement in fit
–   stepwise (backward/forward)
Why do Large Animals have Large Brains?
(Schoenemann Brain Behav. Evol. 2004)

• Complete model (r2=0.97):
• Forward stepwise (α-to-enter=0.15; α-to-remove=0.15):
–   1.   Constant (r2=0.00)
–   2.   Constant + Muscle (r2=0.97)
–   3.   Constant + Muscle + Heart (r2=0.97)
–   4.   Constant + Muscle + Heart + Mass (r2=0.97)

-0.18 - 0.82xMass +1.24xMuscle + 0.39xHeart
Why do Large Animals have Large Brains?
(Schoenemann Brain Behav. Evol. 2004)

• Complete model (r2=0.97):
• Backward stepwise (α-to-enter=0.15; α-to-remove=0.15):
– 1. All (r2=0.97)
– 2. Remove Bone (r2=0.97)
– 3. Remove Fat (r2=0.97)

-0.18 - 0.82xMass +1.24xMuscle + 0.39xHeart
Comparing models
• Mallow’s C(p)
– C(p) = (k-p).F(p) + (2p-k+1)
• k parameters in full model; p parameters in restricted model
• F(p) is the F value comparing the fit of the restricted model
with that of the full model
– Lowest C(p) is best model
• Akaike Information Criteria (AIC)
– AIC=n.Log(σ2) +2p
– Lowest AIC indicates best model
– Can compare models not included in one
another
Comparing models

Model                        AIC

Constant                     48.98071
Mass                         -14.3334
Fat                          -6.60649
Muscle                       -17.4755
Bone                         -10.7256
Heart                         -15.569
Mass,Muscle,Heart            -12.0555
Mass,Fat,Muscle,Bone,Heart      -7.569
Collinearity
• If two (or more) X’s are linearly related:
– they are collinear
– the regression problem is indeterminate
X(3)=5.X(2)+16, or
X(2)=4.X(1)+ 16.X(4)
• If they are nearly linearly related (near
collinearity), coefficients and tests are very
inaccurate
•   Centering (mean = 0)
•   Scaling (SD =1)
•   Regression on first few Principal Components
•   Ridge Regression
Curvilinear (Polynomial)
Regression
• Y = ß0 + ß1⋅X + ß2⋅X² + ß3⋅X3 + ... + ßk⋅Xk + E

• Used to fit fairly complex curves to data
• ß’s estimated using least squares
• Use sequential partial F-tests, or AIC, to find how
many terms to use
– k>3 is rare in biology
• Better to transform data and use simple linear
regression, when possible
Curvilinear (Polynomial)
Regression
0.60
Y=0.066 + 0.00727.X
Y=0.117 + 0.00085.X + 0.00009.X²
Y=0.201 - 0.01371.X + 0.00061.X²                           0.45

Allele frequency
- 0.000005.X3
0.30
2
Degree    r       P(remove) AIC    Sigma

1         0.814   0.000    -88.6   0.0696
0.15

2         0.846   0.060    -89.9   0.0632

0.00
3         0.884   0.034    -92.8   0.0547                          0   10 20 30 40 50 60              70
Miles east of Southport, CT

From Sokal and Rohlf
Path Analysis
Path Analysis
• Models with causal
structure                 A       B
• Represented by path
diagram
• All variables             C   D
quantitative
• All path relationships
assumed linear                        E
– (transformations may
help)
Path Analysis
• All paths one way
– A => C                      A       B
– C => A
• No loops
• Some variables may not be     C   D
directly observed:
– residual variables (U)                      U
• Some variables not observed               E
but known to exist
– latent variables (D)
Path Analysis
• Path coefficients and other
statistics calculated using
multiple regressions
• Variables are:                   A       B
– centered (mean = 0) so no
constants in regressions
– often standardized (SD = 1)   C   D
• So: path coefficients usually
between -1 and +1                                U

• Paths with coefficients not                  E
significantly different from
zero may be eliminated
Path Analysis: an example

• Isaak and Hubert. 2001. “Production of