# Examples

Document Sample

```					                                      Examples Week 9

1.   Examine the general patterns exhibited in the following charts and respond to
the questions that follow. (Do not count the number of observations shown in
each chart or scrutinize specific data points, just observe the general pattern.)
The independent variable is indicated by X, the dependent by Y, and the
residual by R.

CHART NUMBER 1
CHART NUMBER 2             CHART NUMBER 3

Y
Y                          Y

X
X                          X

CHART NUMBER 4
CHART NUMBER 5
CHART NUMBER 6

R 0
R 0
R 0

X
X
X

A.         Assume that a straight line has been fit to each of the raw data plots
shown in charts 1, 2 and 3. Make the best match of the most probable
residual plot associated with each regression. Each residual plot (charts
4, 5 and 6) is to be assigned only once.
Chart 1: Chart: __    Chart 2: Chart: __       Chart 3: Chart: __
B.         Which residual chart best illustrates heteroscedasticity?               Chart __
C.         The simple linear regression of which data set is most likely to have a
negative coefficient of correlation, r?                          Chart__
D.         Which data set is the best candidate for fitting a quadratic relationship?
Chart__
E.         Which data set is the best candidate for a non-linear model based on data
transformation?                                                  Chart__
2.   A study was conducted on how the yield of tomatoes (Y kilograms per
hectare) varied with the amount of fertilizer applied (X kilograms per hectare).
The amount of fertilizer applied was varied from none to 5000 kg/ha in this
experiment. The hypothesized model was: Y=β0+β1X+ε. A simple least-
squares fit of the data yielded the following regression program output:
df      SS      MS           F      Signif F
Regression    1     7860430 7860430       109     1.07E-06
Residual      10    721080 72108
Total         11    8581510

Coeff Std Error t Stat P-value Lower 95% Upper 95%
Intercept 592.96 137.42     4.31 1.53E-03 286.8       899.2
Fertilizer 0.484 0.046     10.44 1.07E-06    0.38      0.59

A.   What is the expected yield of tomatoes if 5000 kilograms of fertilizer are
applied per hectare?                                              _________
B.   In order to test whether the regression is significant at the 0.05 level of
significance we can compare the Ftest value with the Ftable value of:
_________
C.   In the test of part B above it is decided to (circle one):
ACCEPT REJECT
the null hypothesis.
D.   Write out the null hypothesis of B above using the symbols of our text.
H0: ______________
E.   It is almost certain that the mean yield estimate calculated in part A
above is too high? Why?
__________________________________________________________
3.   Observations as listed below were collected on randomly selected individuals.
The relationship between hearing loss (Y, measured in decibels) and exposure
time (X, measured in weeks) to high noise levels was examined using simple
linear regression.

Y 11.9 12.6 12.7 13.2 13.8 14.0 14.1 14.6 14.7 14.8 15.1 15.3
X 12 19 31 43 47 56 74 75 116 160 164 178

Standard
Coefficients                  t Stat
Error
Intercept         12.55        0.251        50.00
X            0.0166       0.00254        6.55

A.   What is the expected level of hearing loss if an individual is exposed to
these high noise levels for a period of 52 weeks?
_____________

B.   For a test of H0: B1=0 against the alternative H1: B10, at the α=0.05

1.    The table value of the test statistic is:               _____________

2.    The decision is to: (Circle one.)           Reject H0       Accept H0

3.    Using this equation (the form of which was selected on an
empirical basis) to predict hearing loss for a 6 year exposure period
now seems quite justified (circle one).        Agree      Disagree
4.   Answer the following questions by either writing in the correct word or phrase
or circling the number of the correct answer.

A.   When contemplating the use of simple linear regression on a set of data
the typical first step in the analysis consists of ___________________
the data.
B.   When prior, well-established theory suggests a theoretical form for the
regression equation and this form is fit to the observed data we are using
a model selection approach referred to as being (circle one):
1. (rational or mechanistic)
2. (empirical or approximate)
3. (neither of these two)
C.   When the variance of the error term in linear regression is constant
across all values of the independent variable it is referred to as being
______________________
D.   When a very large sample is taken and a hypothesis about the value of
the population mean is to be tested using student's t it is important that
the observations (circle one):
1. (be taken from a normally distributed population)
2. (be independently distributed)
3. (meet both of these two listed conditions).

5.   Match each of the lettered terms below with the phrase most closely related to
it (each term is to be used only once).

A. Multiple linear regression
B. Confidence interval
C. Hypothesis testing
D. Simple linear regression
E. Correlation matrix
_____ involves mutually exclusive "states of nature"
_____ identifying collinearity between variables
_____ involves more than one independent variable
_____ linear in the coefficients
_____ requires information about the variance
6.   A regression has been developed to predict gasoline mileage for a car based
on its weight in pounds. Observed car weights run from 2000 to 4000 pounds
and mileage from 15 to 35 per gallon. The following output has been
obtained:
Regression Statistics
R Square             0.82
Standard Error       1.94
Observations           89

ANOVA
df          SS      MS           F     Signif F
Regression                1   1484.55 1484.55        396.4 3.77E-34
Residual                 87    325.86    3.75
Total                    88   1810.40

Coeff Std Error       t Stat   P-value
Intercept          47.89     1.25          38.2 3.48E-56
Weight           -0.0079 0.00039          -19.9 3.77E-34
Answer the following questions based on this output.
A.   For a car weighing 2500 pounds the predicted mileage is:     ________

B.   A point estimate for the variance of  (epsilon) is:         ________
7.   Consider the following SLR plots that can be a part of the output when fitting
the equation Y=β0+β1X+ε to 3 different sets of data. (Each of these plots is to
be used once, and only once, in responding to the 3 questions that follow.)

A.   Which of these plots comes closest to suggesting that a quadratic
equation might best fit the data?                     _____________
B.   Which of these plots is most suggestive that a good model has already
been selected for the data?                           _____________
C.   Which of these plots comes closest to suggesting that a cubic equation
might best fit the data?                              _____________

8.   Answer the following questions about simple linear regression (circle or write
A. From the formula [Σei2/(n-2)]½ a point estimate of what parameter may
be obtained (symbol)?                                    _____________
B. What three key assumptions are typically required about the nature of
the errors around the regression line?
___________________________________________________
___________________________________________________
___________________________________________________
C. When using SLR to approximate an unknown functional form we are
typically most interested in testing which of the following null
hypotheses:
(1) H0: β0=0      (2) H0: β1=0    (3) Equally interested in both
D. Simple linear regression can never be used when fitting any curvilinear
relationship between a dependent and an independent variable; i.e., it
must be a linear relationship between X and Y:                T     F
E. A negative coefficient of correlation suggests:
(1) β1>0          (2) heteroscedasticity    (3) Neither
output given below which has been obtained when fitting the
model: P = [β0+β1*A*M+ε]-1 where P is the purchase price
(in thousands of dollars) paid when buying a Datzun Z car
that has an age A (in years) and has been driven a distance M
(in thousands of miles) at the time of sale. Assume that these
observations were randomly selected from all Datzun Z car
sales reported to the department of motor vehicles in the state
of California during a five year period.

Regression Statistics
Multiple R          0.96
R Square            0.91
Standard Error 0.0012
Observations          11
ANOVA
df       SS       MS         F     Signif F
Regression         1     0.000133 0.000133       94.2 4.58E-06
Residual           9     1.27E-05 1.406E-06
Total             10 0.0001452
Coeff Std Error t Stat P-value Lower 95% Upper 95%
Intercept       0.00564 0.000780       7.22 4.95E-05     0.0039 0.0074
Age*Miles     2.19E-05 2.25E-06        9.71 4.58E-06 1.68E-05 2.69E-05
A.   How much of the total variability observed in 1/P is explained by the
hypothesized form of the relationship with A and M?        ________
B.   What is the best estimate, based on this regression, of the average
purchase price of Datzun Z cars when bought new?          ________
C.   What is (are) the critical value(s) for a t-test of H0:β1=0 against H1:β1≠0
when using a type-one error level of one percent?               ________
D.   When running the test in question C for this problem above why should
you be cautious about the result?
_________________________________________________________
10. This table and the 2 graphs below it contain SLR output based on data from a
study of Holstein-Friesan milk cows. Respond to the questions that follow by
referring to the table entries and the plotted information about the residuals.
Regression Statistics
Multiple R    0.98
R Square      0.96
Std Error   0.0420
Obs            14
ANOVA
df         SS         MS         F      Signif F
Regression      1       0.4602     0.4602     260.48   1.68E-09
Residual       12       0.0212     0.0018
Total          13       0.4814
Lower     Upper
Coeff Std Error     t Stat    P-value      95%       95%
Intercept      0.176  0.0464        3.78    2.60E-03     0.0745    0.277
X             0.0246 0.0015        16.14    1.68E-09     0.0213    0.0279
A.   Write out the linear equation relating the independent variable, milk
production (kg/day), to the dependent variable, milk protein (kg/day).
___________________________________________
B.   The R-square value of 0.9560 in the table indicates that 95.6% of the
total variability in __________________ is explained by this
______________ relationship with __________________.
C.   Assuming that a model of the form Y=β0+β1X+ε with ε~N(0,σ2) is
appropriate, and given that it has been fit as shown in the table above,
the best point estimate of the std.dev. of the pop. error around the line is:
_________
D.   Setting the type I error at 5% we (circle one)
can       cannot
reject H0 β1 =0.0 in favor of H1 β1 ≠0.0
E.   The 95% CI for the population value for the slope of the line is:
_________
F.   What is the estimated average amount of milk protein (kg/day) to be
expected from cows producing 30 kg of milk per day?
_________
G.   The accuracy of the estimated average amount of milk protein calculated
for question F above is problematic due to extrapolation (circle one):
True      False
H.   The assumption of homoscedasticity in the residuals is justified because
of:
________________________________________________________
I.   The assumption of normality in the distribution of the residuals is
justified because of:
________________________________________________________
J.   In running the test on H0:=0.0 versus H1:≠0.0 with α=0.05 it is found
that:
ttest = ___________ and         ttable = ±__________
K.   In running the test on H0:β1 ≥ 0.03 versus H1:β1 < 0.03 with α=0.05 it is
found that:
ttest = ___________ and         ttable = ±__________
11. A simple linear regression model, Y = β0 + β1 X½ + ε, with ε~N(0,σ2), is run
on a data set with the following results:

Regression Statistics
Multiple R    0.978
R Square      0.956
Std Error     0.679
Observations      10
ANOVA
df     SS      MS       F     Signif F
Regression          1   80.33 80.33 174.47 1.03E-06
Residual            8    3.68 0.46
Total               9   84.01
Coeff Std Error t Stat P-value Lower 95% Upper 95%
Intercept     2.028 0.6908 2.94        0.019      0.435     3.621
SQRT(X)       4.877 0.3692 13.21 1.03E-06         4.026     5.729

Answer the following questions with regard to this model and the output
presented here.
A.   Give the point estimate for each of the following parameters:
1.    ρ ________              2.    σ    ________
3.    β1 ________             4.    β0   ________
B.   An estimate is needed for the value of the dependent variable when
the independent variable is 49.
1.   The best estimate given by this relationship is: ____________
2.   This estimate is suspect due to the fact that it is based on an
(one word)                   _________________________
C.   A test is run on: H0: β0 = 0 versus H1: β0 ≠ 0 with α=0.01. The
decision is to:                                 ____________
D.   A test is run on: H0: β1 ≤ 4.0 versus H1: β1 > 4.0 with α=0.05.
1.   The value of the test statistic is:        _______________
2.   The table value is:                        _______________
E.   The following assumptions are justified by what evidence given here
1.   Homoscedasticity
________________________________________________
2.   Normality
________________________________________________
12. It is suspected that the protein content (percentage) of wheat varies with the
yield (bushels per acre). Data are analyzed with the following results when
SLR is run using the standard model: Protein = β0 + β1 * Yield + ε, with ε ~
Normal(0,σ2).
Multiple R 0.781
R Square 0.611
Std Error 1.420
Observations       19
ANOVA              df        SS       MS        F        Signif F
Regression      1       53.76     53.76 26.66        7.80E-05
Residual     17      34.29     2.02
Total     18      88.05
Coeff Std Error t Stat P-value Lower 90.0% Upper 90.0%
Intercept 16.0536 0.7542 21.2868 1.08E-13              14.74        17.37
Yield      -0.1585 0.0307 -5.1629 7.80E-05           -0.21        -0.11

A.   What is the expected percentage protein content when the yield per acre
is 30 bushels?                                                  _______
B.   What proportion of the total variation in protein content is explained by
the linear relationship with yield?                             _______
C.   What is the point estimate for σ?                               _______
-5
D.   The p-value of 7.80*10 clearly supports rejecting what standard SLR
null? (Give formal null statement.)                             _______
E.   It could be surmised that a nonlinear equation form; e.g., a quadratic,
might provide a good fit to these data. This conjecture is supported by
the: ____________________________________________________
13. Houseflies are known to generally emerge faster at higher temperatures.
Observations on number of days, Y, to emergence at different temperatures,
X, have been obtained and simple linear regression (SLR) has been run on two
different forms, Form I: Y=β0+β1X+ε and Form II: Y-1=β0+β1X+ε. All
important factors other than temperature have been either controlled or
randomized in the collection of these data.

Form I
R Square          0.876
Standard Error    2.245
Observations         36
ANOVA
df       SS       MS       F     Signif F
Regression            1     1208.   1208.     239. 5.74E-17
Residual             34      171.      5.
Total                35     1379.
Coeff Std Error    t Stat P-value Lower 95% Upper 95%
Intercept         71.24     3.658      19.5 5.13E-20    63.81     78.67
Temp             -0.732   0.0473      -15.5 5.74E-17   -0.828    -0.636
Form II
R Square         0.934
Standard Error 0.0071
Observations        36
ANOVA
df       SS      MS     F     Signif F
Regression           1 0.02431 0.0243     483. 1.13E-21
Residual            34 0.001712 5. E-05
Total               35 0.02602
Coeff Std Error t Stat P-value Lower 95% Upper 95%
Intercept       -0.175 0.01156 -15.1 1.15E-16     -0.198    -0.151
Temp          0.00328 0.0001494 22.0 1.13E-21 0.00298 0.00359
Answer the following questions based on the results given here.
A.    A prediction of the emergence time (in days) is to be made for the
case where temperature is held at 75ºF.
1.    The predicted time using Form I is (give answer to 1 decimal
place):                                   ________________
2.    The predicted time using Form II is (give answer to 1
decimal place):                           ________________
B.    There is clear evidence that what assumption(s) of SLR has (have)
been violated by Form II?
_____________________                _____________________
C.    One of the methods we have examined for choosing a form for the
equation to be fit to the data is that of polynomial approximation to
an unknown functional relationship between dependent and
independent variable(s). Considering the SLR results for Form I it
would appear worthwhile to try the form: ________________
D.    The proportion of the total variance in the inverse of the observed
emergence times that has been "explained" by the
__________________________ with temperature is _____.
E.    Assume now that all necessary assumptions have been met to test
the null hypothesis that β1 ≤ 0.003 against the alternative that β1 >
0.003 for Form II. Use the given numerical values and a 5% level
of significance.
1.    The value of the test statistic is:       ________________
2.    The table value is:                       ________________
14. A test was conducted in which five different groups of rats were maintained
on a diet deficient in Vitamin A and then given
supplementary rations of Vitamin A in the form of
cod-liver oil. The dosage of cod-liver oil (D) given
to each group of rats and the corresponding mean
weight gains (Wt) are plotted in the graph. A simple
linear regression (SLR) model, Wt=β0+β1D+ε, is
run first with the results shown in the table
immediately below.

Regression Statistics
Multiple R 0.9135
R Square 0.8344
Standard Error 10.6602
Observations       5
ANOVA
df             SS        MS       F    Significance F
Regression       1          1718.15    1718.15 15.12      3.02E-02
Residual       3           340.92     113.64
Total       4          2059.07
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept    1.50           6.70       0.22 8.37E-01     -19.83     22.83
Dosage (mg)     7.18           1.85       3.89 3.02E-02      1.30      13.06

A second model, Wt=β0+β1Loge{D}+ε, is then run with the results shown in
this second table. Respond to the questions that follow with regard to these
two models. (Note: Loge{D} and LN{D} both represent the natural logarithm
of D.)
Regression Statistics
Multiple R 0.9948
R Square 0.9897
Standard Error 2.6632
Observations       5
ANOVA
df             SS        MS       F    Significance F
Regression       1          2037.79 2037.79 287.31        4.47E-04
Residual       3           21.28       7.09
Total      4          2059.07
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 12.76             1.26      10.12 2.06E-03       8.75     16.78
LN{Dosage} 18.09               1.07      16.95 4.47E-04      14.70     21.49
A.   Both of these models are justified at α=0.05 based on significance test
results shown in the two tables. Given only the evidence presented in
these two tables (not the graph) which model would you select?
___________________________
Give two reasons why you selected this particular model over the other
model based on the regression results listed in the two tables. (Ignore the
significance test results and the graph in responding to this question.)
1.   ____________________________________
2.   ____________________________________
B.   What would be the estimated weight gain in grams for dosage of 4.0
mg?
1.   Using the first model it is:                         _______
2.   Using the second it is:                              _______
C.   Address the following questions using the information provided in the
residual and QQ plots shown here below for the two models.

1.    Given only the evidence presented in these four graphs which
model would you select?
___________________________
2.    Give two reasons why you selected this particular model over
the other model given here based only on the evidence
provided in these four graphs.
a.    ______________________________
b.    ______________________________
15. Birch tree seedlings grown in peat-filled containers were outplanted in August
in a dry sandy soil. After four weeks
the planted seedlings were examined
for the level of mortality. It was
observed that the percentage mortality
(PM) was dependent upon the
percentage water content (PWC) of the
container peat at the time of planting.
This relationship was plotted and a
least-squares fit of PM=β0+β1PWC+ε
was made. The observed values are
shown on the chart along with the regression equation and its R2 value.

A.   Calculate the PM when the PWC is 45%:                         _________
B.   What is the correlation coefficient value?                    _________
C.   Very roughly plot the residuals on the chart provided here below.

D.   What evidence is there for, or against, using a higher power (greater than
the first) polynomial for the regression equation?
___________________________________________________
___________________________________________________
E.   The procedure being used here in deciding upon the "form" of the
relationship between PM and PWC is known as (circle one):
(a)   Mechanistic           (b)   Approximation

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 19 posted: 11/6/2012 language: English pages: 17