# Worksheet 5 Key 2012 by r4NxxJ3V

VIEWS: 6 PAGES: 9

• pg 1
```									Worksheet 5: Multiple Regression, Non-Linear Regression, Fixed and Random Factors

1) Multiple Regression
a. ii. These scatterplots below present our first chance to investigate whether or not we might
have any problems with collinearity between the independent variables. What we are looking
for, and hopefully will not have, are relationships between the independent variables such as
positive or negative correlations. In this case, none of the scatterplots indicate any relationships
between the variables (note the “shot-gun” patterns) except for the relationship between our
dependent variable and one of the independent variables (limpet_A vs. food). Of course it is
okay if there is a relationship between our dependent and one of the independent, and we may
expect such a relationship, since that is why we are analyzing the data with a multiple regression
in the first place! With this dataset we want to know whether or not the abundance of limpet_A
varies with any of the independent variables (food, other limpets, predators).
FOOD   LIMPET_A OTH_LIMPETS TIDE_HT   PREDS
FOOD

FOOD
LIMPET_A OTH_LIMPETS TIDE_HT
TIDE_HT OTH_LIMPETS LIMPET_A

PREDS
PREDS

FOOD   LIMPET_A OTH_LIMPETS TIDE_HT   PREDS

a. iii. Here are the results of the multiple regression analysis with the model
LIMPET_A=constant+FOOD+TIDE_HT+OTH_LIMPETS+PREDS
Condition Indices
1       2       3       4       5
1.00000 3.40232 4.40920 6.67319 25.54445

Dependent Variable          LIMPET_A
N                           19
Multiple R                  0.99989
Squared Multiple R          0.99979
Standard Error of Estimate 1.62951

-1
Regression Coefficients B = (X'X) X'Y
Effect       Coefficient Standard Error Std.        Tolerance t        p-value
Coefficient
CONSTANT     -23.39511 3.31140          0.00000     .         -7.06503 0.00001
FOOD         1.00605     0.00483        0.94218     0.73872 208.35328 0.00000
TIDE_HT      0.97785     0.06544        0.06813     0.72654 14.94221 0.00000
-1
Regression Coefficients B = (X'X) X'Y
Effect       Coefficient Standard Error Std.        Tolerance t    p-value
Coefficient
OTH_LIMPETS -1.05361 0.02831            -0.15449 0.87693 -37.22302 0.00000
PREDS        -0.07318 0.13684           -0.00221 0.88323 -0.53480 0.60118

Confidence Interval for Regression Coefficients
Effect        Coefficient 95.0% Confidence Interval VIF
Lower       Upper
CONSTANT      -23.39511 -30.49735     -16.29288     .
FOOD          1.00605     0.99570     1.01641       1.35370
TIDE_HT       0.97785     0.83749     1.11821       1.37639
OTH_LIMPETS -1.05361 -1.11432         -0.99291      1.14034
PREDS         -0.07318 -0.36669       0.22032       1.13220

Analysis of Variance
Source     Type III SS   df Mean Squares F-ratio     p-value
Regression 175,741.77291 4 43,935.44323 16,546.20680 0.00000
Residual 37.17445        14 2.65532

Durbin-Watson D Statistic 2.55602
First Order Autocorrelation -0.28273

Information Criteria
AIC             78.67214
AIC (Corrected) 85.67214
Schwarz's BIC 84.33877

Plot of Residuals against Predicted Values

3

2

1
RESIDUAL

0

-1

-2

-3
0   100      200     300   400
ESTIMATE

a. iv. Assumptions:
1. Normality - should do p-plots for each of the variables in the model before you even run the
analysis
2. Homogeneity of variance - check residual scatterplot for “shot-gun” pattern in residuals of
dependent variable, not a “wedge” pattern
3. Independence of observations - are each of the observations of the dependent variable
independent? e.g. from randomly chosen plots
4. Linearity - if there are relationships between the dependent and any of the independent
variables, are these relationships linear?
5. Collinearity - Three ways to check for collinearity:
a. scatterplot matrix – no correlations between independents,
b. condition indices – values <15 are fine, but if between 15-30 need to worry (check
tolerance values too), and if >30 definitely need to worry (may want to exclude one
of the collinear (redundant) factors from the model, or do a principal components
analysis first to reduce number of independent variables in the model – you will learn
c. tolerance – values >0.20 indicate that collinearity is not a problem, but <0.20
indicates that the model is not “tolerant” to the collinearity that factor introduces

a. viii.
a. ix. Yes, they are different from the first scatterplots. They show what part of the variance in the
dependent variable (y) each independent variable (xi) explains while factoring out the effect of any of the
other independent variables.

a. x. The abundance of limpets = -23.395+1.006*(125)+0.978*(55)-1.054*(43)-0.073*(5.3)=       110.44

2) a) This scattlerplot below shows that the relationship between species and area is non-linear.
80

70
SPECIES

60

50

40

30
0    5000   10000     15000    20000
AREA

2) a.. i)
Running the first model: Y=(a*X^b)/(c+X)– this model has 3 parameters that we are fitting – a,b,c
Dependent variable is SPECIES

Dependent Variable                  :SPECIES

Sum of Squares and Mean Squares
Source         SS            df Mean Squares
Regression     237,992.98220 3 79,330.99407
Residual       1,381.01780 54 25.57440
Total          239,374.00000 57
Mean corrected 7,945.50877 56

R-squares

Raw R-square (1-Residual/Total)                : 0.99423
Mean Corrected R-square (1-Residual/Corrected) : 0.82619
R-square(Observed vs Predicted)                : 0.82639

Parameter Estimates
Parameter Estimate ASE                 Parameter/ASE Wald 95% Confidence Interval
Lower       Upper
A                   49.75867 7.14859 6.96063         35.42661    64.09072
B                   1.04288 0.01587 65.70585         1.01106     1.07470
C                   106.09789 27.33661 3.88117       51.29129    160.90449
Scatter Plot

80

70
SPECIES

60

50

40

30
0        5,000       10,000     15,000         20,000
AREA
Running the second model: Y=(a*X^b) – this model has 2 parameters that we are fitting - a and b.
Dependent Variable               :SPECIES

Sum of Squares and Mean Squares
Source         SS            df Mean Squares
Regression     237,335.15908 2 118,667.57954
Residual       2,038.84092 55 37.06983
Total          239,374.00000 57
Mean corrected 7,945.50877 56

R-squares

Raw R-square (1-Residual/Total)                : 0.99148
Mean Corrected R-square (1-Residual/Corrected) : 0.74340
R-square(Observed vs Predicted)                : 0.74387

Parameter Estimates
Parameter Estimate ASE              Parameter/ASE Wald 95% Confidence Interval
Lower        Upper
A                  27.05393 2.04930 13.20153      22.94703     31.16082
B                  0.10882 0.00911 11.94865       0.09057      0.12708
ii. – 1 (two term model)
80

70
SPECIES

60

50

40

30
40   50          60       70   80
ESTIMATE
Dependent Variable          SPECIES
N                           57
Multiple R                  0.86248
Squared Multiple R          0.74387
Standard Error of Estimate 6.08288
-1
Regression Coefficients B = (X'X) X'Y
Effect    Coefficient Standard Error Std.         Tolerance t        p-value
Coefficient
CONSTANT -1.66917 5.23606             0.00000     .         -0.31878 0.75110
ESTIMATE 1.02556      0.08114         0.86248     1.00000 12.63863 0.00000

Confidence Interval for Regression Coefficients
Effect    Coefficient 95.0% Confidence Interval VIF
Lower        Upper
CONSTANT -1.66917 -12.16246         8.82413     .
ESTIMATE 1.02556       0.86294      1.18818     1.00000

Analysis of Variance
Source     SS          df Mean Squares F-ratio p-value
Regression 5,910.42803 1 5,910.42803 159.73496 0.00000
Residual 2,035.08074 55 37.00147

Durbin-Watson D Statistic 1.15039
First Order Autocorrelation 0.39325

Information Criteria
AIC             371.54764
AIC (Corrected) 372.00047
Schwarz's BIC 377.67680

3 term model

80

70
SPECIES

60

50

40

30
30   40         50             60   70    80
ESTIMATE
Dependent Variable          SPECIES
N                           57
Multiple R                  0.90906
Squared Multiple R          0.82639
Standard Error of Estimate 5.00803

-1
Regression Coefficients B = (X'X) X'Y
Effect    Coefficient Standard Error Std.         Tolerance t       p-value
Coefficient
CONSTANT 0.99219      3.93310         0.00000     .         0.25227 0.80178
ESTIMATE 0.98486      0.06087         0.90906     1.00000 16.18028 0.00000

Confidence Interval for Regression Coefficients
Effect     Coefficient 95.0% Confidence Interval VIF
Lower        Upper
CONSTANT 0.99219       -6.88993     8.87430      .
ESTIMATE 0.98486       0.86288      1.10685      1.00000

Analysis of Variance
Source     SS          df Mean Squares F-ratio p-value
Regression 6,566.08703 1 6,566.08703 261.80158 0.00000
Residual 1,379.42174 55 25.08040

Durbin-Watson D Statistic 1.58886
First Order Autocorrelation 0.18436

Information Criteria
AIC             349.38199
AIC (Corrected) 349.83482
Schwarz's BIC 355.51114

iii The three term model looks best

iv Use the added fit models (compare the added fit relative to the expected added fit based on
change in number of parameters)

v. Comparing slopes is much easier with linear models. For example lets assume that you had
another treatment, which was after application of an antibiotic and you want to see if the species
area relationship varied as a function of whether an antibiotic had been applied or not. Here you
could plot the two linear functions (estimate vs species, with and without antibiotics) and
compare the slopes and intercepts.

3) a. Fixed – we want to know about these two fertilizer brands specifically, and do not want to
extrapolate our results beyond those two brands
b. Random – we are randomly choosing batches to compare, not interested in specific batches,
but rather want to assess variation among all batches
c. i. Any spatial variable will be fixed if you didn’t choose them randomly, i.e. you are asking
a question about these specific locations and not trying to infer something about variation at a
larger scale. For example, if you are looking at three sites in the north and three in the south but
they are not chosen randomly due to logistics (i.e. the only sites that are accessible) or other
hypotheses you are testing, this would be a fixed effect. If, however, there were a number of
sites in the north and in the south and you then randomly chose 3 to sample both in the north and
in the south, you could extrapolate beyond those specific sites to make more general conclusions
about the north vs. the south – then the site variable would be considered random.
d. i. If you randomize your sampling effort in time, it can be considered a random effect. For
example, if you are doing pollinator observations but had limited number of observations you
could make at the same time (i.e. only one population per day or week), you could randomize
when you observed each population over time during a specific time period (such as within a
season, when you do not expect your observations to vary because of any temporal effects such
as variation in conditions across seasons, storms, etc.) and then use time as a random effect.

```
To top