Examples

Document Sample
Examples Powered By Docstoc
					                                      Examples Week 9

1.   Examine the general patterns exhibited in the following charts and respond to
     the questions that follow. (Do not count the number of observations shown in
     each chart or scrutinize specific data points, just observe the general pattern.)
     The independent variable is indicated by X, the dependent by Y, and the
     residual by R.

                 CHART NUMBER 1
                                            CHART NUMBER 2             CHART NUMBER 3




       Y
                                       Y                          Y




                        X
                                                   X                          X




                   CHART NUMBER 4
                                            CHART NUMBER 5
                                                                       CHART NUMBER 6




          R 0
                                      R 0
                                                               R 0



                            X
                                                   X
                                                                              X




     A.         Assume that a straight line has been fit to each of the raw data plots
                shown in charts 1, 2 and 3. Make the best match of the most probable
                residual plot associated with each regression. Each residual plot (charts
                4, 5 and 6) is to be assigned only once.
                       Chart 1: Chart: __    Chart 2: Chart: __       Chart 3: Chart: __
     B.         Which residual chart best illustrates heteroscedasticity?               Chart __
     C.         The simple linear regression of which data set is most likely to have a
                negative coefficient of correlation, r?                          Chart__
     D.         Which data set is the best candidate for fitting a quadratic relationship?
                                                                                 Chart__
     E.         Which data set is the best candidate for a non-linear model based on data
                transformation?                                                  Chart__
2.   A study was conducted on how the yield of tomatoes (Y kilograms per
     hectare) varied with the amount of fertilizer applied (X kilograms per hectare).
     The amount of fertilizer applied was varied from none to 5000 kg/ha in this
     experiment. The hypothesized model was: Y=β0+β1X+ε. A simple least-
     squares fit of the data yielded the following regression program output:
                   df      SS      MS           F      Signif F
     Regression    1     7860430 7860430       109     1.07E-06
     Residual      10    721080 72108
     Total         11    8581510

                Coeff Std Error t Stat P-value Lower 95% Upper 95%
     Intercept 592.96 137.42     4.31 1.53E-03 286.8       899.2
     Fertilizer 0.484 0.046     10.44 1.07E-06    0.38      0.59




     A.   What is the expected yield of tomatoes if 5000 kilograms of fertilizer are
          applied per hectare?                                              _________
     B.   In order to test whether the regression is significant at the 0.05 level of
          significance we can compare the Ftest value with the Ftable value of:
                                                                            _________
     C.   In the test of part B above it is decided to (circle one):
                                  ACCEPT REJECT
          the null hypothesis.
     D.   Write out the null hypothesis of B above using the symbols of our text.
                                                                H0: ______________
     E.   It is almost certain that the mean yield estimate calculated in part A
          above is too high? Why?
          __________________________________________________________
3.   Observations as listed below were collected on randomly selected individuals.
     The relationship between hearing loss (Y, measured in decibels) and exposure
     time (X, measured in weeks) to high noise levels was examined using simple
     linear regression.

     Y 11.9 12.6 12.7 13.2 13.8 14.0 14.1 14.6 14.7 14.8 15.1 15.3
     X 12 19 31 43 47 56 74 75 116 160 164 178

                                                 Standard
                                 Coefficients                  t Stat
                                                   Error
                   Intercept         12.55        0.251        50.00
                       X            0.0166       0.00254        6.55

     A.   What is the expected level of hearing loss if an individual is exposed to
          these high noise levels for a period of 52 weeks?
                                                                   _____________

     B.   For a test of H0: B1=0 against the alternative H1: B10, at the α=0.05
          level answer the following:

          1.    The table value of the test statistic is:               _____________

          2.    The decision is to: (Circle one.)           Reject H0       Accept H0

          3.    Using this equation (the form of which was selected on an
                empirical basis) to predict hearing loss for a 6 year exposure period
                now seems quite justified (circle one).        Agree      Disagree
4.   Answer the following questions by either writing in the correct word or phrase
     or circling the number of the correct answer.

     A.   When contemplating the use of simple linear regression on a set of data
          the typical first step in the analysis consists of ___________________
          the data.
     B.   When prior, well-established theory suggests a theoretical form for the
          regression equation and this form is fit to the observed data we are using
          a model selection approach referred to as being (circle one):
                1. (rational or mechanistic)
                2. (empirical or approximate)
                3. (neither of these two)
     C.   When the variance of the error term in linear regression is constant
          across all values of the independent variable it is referred to as being
                                                        ______________________
     D.   When a very large sample is taken and a hypothesis about the value of
          the population mean is to be tested using student's t it is important that
          the observations (circle one):
          1. (be taken from a normally distributed population)
          2. (be independently distributed)
          3. (meet both of these two listed conditions).

5.   Match each of the lettered terms below with the phrase most closely related to
     it (each term is to be used only once).

     A. Multiple linear regression
     B. Confidence interval
     C. Hypothesis testing
     D. Simple linear regression
     E. Correlation matrix
     _____ involves mutually exclusive "states of nature"
     _____ identifying collinearity between variables
     _____ involves more than one independent variable
     _____ linear in the coefficients
     _____ requires information about the variance
6.   A regression has been developed to predict gasoline mileage for a car based
     on its weight in pounds. Observed car weights run from 2000 to 4000 pounds
     and mileage from 15 to 35 per gallon. The following output has been
     obtained:
         Regression Statistics
       R Square             0.82
       Standard Error       1.94
       Observations           89

       ANOVA
                           df          SS      MS           F     Signif F
       Regression                1   1484.55 1484.55        396.4 3.77E-34
       Residual                 87    325.86    3.75
       Total                    88   1810.40

                         Coeff Std Error       t Stat   P-value
       Intercept          47.89     1.25          38.2 3.48E-56
       Weight           -0.0079 0.00039          -19.9 3.77E-34
     Answer the following questions based on this output.
     A.   For a car weighing 2500 pounds the predicted mileage is:     ________

     B.   A point estimate for the variance of  (epsilon) is:         ________
7.   Consider the following SLR plots that can be a part of the output when fitting
     the equation Y=β0+β1X+ε to 3 different sets of data. (Each of these plots is to
     be used once, and only once, in responding to the 3 questions that follow.)




     A.   Which of these plots comes closest to suggesting that a quadratic
          equation might best fit the data?                     _____________
     B.   Which of these plots is most suggestive that a good model has already
          been selected for the data?                           _____________
     C.   Which of these plots comes closest to suggesting that a cubic equation
          might best fit the data?                              _____________

8.   Answer the following questions about simple linear regression (circle or write
     in the answer as required):
     A. From the formula [Σei2/(n-2)]½ a point estimate of what parameter may
           be obtained (symbol)?                                    _____________
     B. What three key assumptions are typically required about the nature of
           the errors around the regression line?
                 ___________________________________________________
                 ___________________________________________________
                 ___________________________________________________
     C. When using SLR to approximate an unknown functional form we are
           typically most interested in testing which of the following null
           hypotheses:
                 (1) H0: β0=0      (2) H0: β1=0    (3) Equally interested in both
     D. Simple linear regression can never be used when fitting any curvilinear
           relationship between a dependent and an independent variable; i.e., it
           must be a linear relationship between X and Y:                T     F
     E. A negative coefficient of correlation suggests:
                 (1) β1>0          (2) heteroscedasticity    (3) Neither
9.   Answer the following questions about the linear regression
     output given below which has been obtained when fitting the
     model: P = [β0+β1*A*M+ε]-1 where P is the purchase price
     (in thousands of dollars) paid when buying a Datzun Z car
     that has an age A (in years) and has been driven a distance M
     (in thousands of miles) at the time of sale. Assume that these
     observations were randomly selected from all Datzun Z car
     sales reported to the department of motor vehicles in the state
     of California during a five year period.

       Regression Statistics
     Multiple R          0.96
     R Square            0.91
     Standard Error 0.0012
     Observations          11
     ANOVA
                        df       SS       MS         F     Signif F
     Regression         1     0.000133 0.000133       94.2 4.58E-06
     Residual           9     1.27E-05 1.406E-06
     Total             10 0.0001452
                      Coeff Std Error t Stat P-value Lower 95% Upper 95%
     Intercept       0.00564 0.000780       7.22 4.95E-05     0.0039 0.0074
     Age*Miles     2.19E-05 2.25E-06        9.71 4.58E-06 1.68E-05 2.69E-05
     A.   How much of the total variability observed in 1/P is explained by the
          hypothesized form of the relationship with A and M?        ________
     B.   What is the best estimate, based on this regression, of the average
          purchase price of Datzun Z cars when bought new?          ________
     C.   What is (are) the critical value(s) for a t-test of H0:β1=0 against H1:β1≠0
          when using a type-one error level of one percent?               ________
     D.   When running the test in question C for this problem above why should
          you be cautious about the result?
          _________________________________________________________
10. This table and the 2 graphs below it contain SLR output based on data from a
    study of Holstein-Friesan milk cows. Respond to the questions that follow by
    referring to the table entries and the plotted information about the residuals.
     Regression Statistics
     Multiple R    0.98
     R Square      0.96
     Std Error   0.0420
     Obs            14
     ANOVA
                    df         SS         MS         F      Signif F
     Regression      1       0.4602     0.4602     260.48   1.68E-09
     Residual       12       0.0212     0.0018
     Total          13       0.4814
                                                              Lower     Upper
                    Coeff Std Error     t Stat    P-value      95%       95%
     Intercept      0.176  0.0464        3.78    2.60E-03     0.0745    0.277
     X             0.0246 0.0015        16.14    1.68E-09     0.0213    0.0279
A.   Write out the linear equation relating the independent variable, milk
     production (kg/day), to the dependent variable, milk protein (kg/day).
            ___________________________________________
B.   The R-square value of 0.9560 in the table indicates that 95.6% of the
     total variability in __________________ is explained by this
     ______________ relationship with __________________.
C.   Assuming that a model of the form Y=β0+β1X+ε with ε~N(0,σ2) is
     appropriate, and given that it has been fit as shown in the table above,
     the best point estimate of the std.dev. of the pop. error around the line is:
                                                                       _________
D.   Setting the type I error at 5% we (circle one)
                                                       can       cannot
     reject H0 β1 =0.0 in favor of H1 β1 ≠0.0
E.   The 95% CI for the population value for the slope of the line is:
                                                                       _________
F.   What is the estimated average amount of milk protein (kg/day) to be
     expected from cows producing 30 kg of milk per day?
                                                                       _________
G.   The accuracy of the estimated average amount of milk protein calculated
     for question F above is problematic due to extrapolation (circle one):
                                                       True      False
H.   The assumption of homoscedasticity in the residuals is justified because
     of:
     ________________________________________________________
I.   The assumption of normality in the distribution of the residuals is
     justified because of:
     ________________________________________________________
J.   In running the test on H0:=0.0 versus H1:≠0.0 with α=0.05 it is found
     that:
                 ttest = ___________ and         ttable = ±__________
K.   In running the test on H0:β1 ≥ 0.03 versus H1:β1 < 0.03 with α=0.05 it is
     found that:
                 ttest = ___________ and         ttable = ±__________
11. A simple linear regression model, Y = β0 + β1 X½ + ε, with ε~N(0,σ2), is run
    on a data set with the following results:




         Regression Statistics
         Multiple R    0.978
         R Square      0.956
         Std Error     0.679
         Observations      10
         ANOVA
                         df     SS      MS       F     Signif F
         Regression          1   80.33 80.33 174.47 1.03E-06
         Residual            8    3.68 0.46
         Total               9   84.01
                       Coeff Std Error t Stat P-value Lower 95% Upper 95%
         Intercept     2.028 0.6908 2.94        0.019      0.435     3.621
         SQRT(X)       4.877 0.3692 13.21 1.03E-06         4.026     5.729



     Answer the following questions with regard to this model and the output
     presented here.
        A.   Give the point estimate for each of the following parameters:
             1.    ρ ________              2.    σ    ________
             3.    β1 ________             4.    β0   ________
B.   An estimate is needed for the value of the dependent variable when
     the independent variable is 49.
     1.   The best estimate given by this relationship is: ____________
     2.   This estimate is suspect due to the fact that it is based on an
          (one word)                   _________________________
C.   A test is run on: H0: β0 = 0 versus H1: β0 ≠ 0 with α=0.01. The
     decision is to:                                 ____________
D.   A test is run on: H0: β1 ≤ 4.0 versus H1: β1 > 4.0 with α=0.05.
     1.   The value of the test statistic is:        _______________
     2.   The table value is:                        _______________
E.   The following assumptions are justified by what evidence given here
     (be specific in your interpretation)?
     1.   Homoscedasticity
           ________________________________________________
     2.   Normality
           ________________________________________________
12. It is suspected that the protein content (percentage) of wheat varies with the
    yield (bushels per acre). Data are analyzed with the following results when
    SLR is run using the standard model: Protein = β0 + β1 * Yield + ε, with ε ~
    Normal(0,σ2).
       Multiple R 0.781
         R Square 0.611
         Std Error 1.420
    Observations       19
    ANOVA              df        SS       MS        F        Signif F
       Regression      1       53.76     53.76 26.66        7.80E-05
          Residual     17      34.29     2.02
             Total     18      88.05
                     Coeff Std Error t Stat P-value Lower 90.0% Upper 90.0%
       Intercept 16.0536 0.7542 21.2868 1.08E-13              14.74        17.37
         Yield      -0.1585 0.0307 -5.1629 7.80E-05           -0.21        -0.11




     A.   What is the expected percentage protein content when the yield per acre
          is 30 bushels?                                                  _______
     B.   What proportion of the total variation in protein content is explained by
          the linear relationship with yield?                             _______
     C.   What is the point estimate for σ?                               _______
                                   -5
     D.   The p-value of 7.80*10 clearly supports rejecting what standard SLR
          null? (Give formal null statement.)                             _______
     E.   It could be surmised that a nonlinear equation form; e.g., a quadratic,
          might provide a good fit to these data. This conjecture is supported by
          the: ____________________________________________________
13. Houseflies are known to generally emerge faster at higher temperatures.
    Observations on number of days, Y, to emergence at different temperatures,
    X, have been obtained and simple linear regression (SLR) has been run on two
    different forms, Form I: Y=β0+β1X+ε and Form II: Y-1=β0+β1X+ε. All
    important factors other than temperature have been either controlled or
    randomized in the collection of these data.

    Form I
    R Square          0.876
    Standard Error    2.245
    Observations         36
    ANOVA
                      df       SS       MS       F     Signif F
    Regression            1     1208.   1208.     239. 5.74E-17
    Residual             34      171.      5.
    Total                35     1379.
                     Coeff Std Error    t Stat P-value Lower 95% Upper 95%
    Intercept         71.24     3.658      19.5 5.13E-20    63.81     78.67
    Temp             -0.732   0.0473      -15.5 5.74E-17   -0.828    -0.636
    Form II
    R Square         0.934
    Standard Error 0.0071
    Observations        36
    ANOVA
                     df       SS      MS     F     Signif F
    Regression           1 0.02431 0.0243     483. 1.13E-21
    Residual            34 0.001712 5. E-05
    Total               35 0.02602
                   Coeff Std Error t Stat P-value Lower 95% Upper 95%
    Intercept       -0.175 0.01156 -15.1 1.15E-16     -0.198    -0.151
    Temp          0.00328 0.0001494 22.0 1.13E-21 0.00298 0.00359
Answer the following questions based on the results given here.
     A.    A prediction of the emergence time (in days) is to be made for the
           case where temperature is held at 75ºF.
           1.    The predicted time using Form I is (give answer to 1 decimal
                 place):                                   ________________
           2.    The predicted time using Form II is (give answer to 1
                 decimal place):                           ________________
     B.    There is clear evidence that what assumption(s) of SLR has (have)
           been violated by Form II?
                 _____________________                _____________________
     C.    One of the methods we have examined for choosing a form for the
           equation to be fit to the data is that of polynomial approximation to
           an unknown functional relationship between dependent and
           independent variable(s). Considering the SLR results for Form I it
           would appear worthwhile to try the form: ________________
     D.    The proportion of the total variance in the inverse of the observed
           emergence times that has been "explained" by the
           __________________________ with temperature is _____.
     E.    Assume now that all necessary assumptions have been met to test
           the null hypothesis that β1 ≤ 0.003 against the alternative that β1 >
           0.003 for Form II. Use the given numerical values and a 5% level
           of significance.
           1.    The value of the test statistic is:       ________________
           2.    The table value is:                       ________________
14. A test was conducted in which five different groups of rats were maintained
    on a diet deficient in Vitamin A and then given
    supplementary rations of Vitamin A in the form of
    cod-liver oil. The dosage of cod-liver oil (D) given
    to each group of rats and the corresponding mean
    weight gains (Wt) are plotted in the graph. A simple
    linear regression (SLR) model, Wt=β0+β1D+ε, is
    run first with the results shown in the table
    immediately below.

           Regression Statistics
            Multiple R 0.9135
             R Square 0.8344
        Standard Error 10.6602
         Observations       5
        ANOVA
                           df             SS        MS       F    Significance F
           Regression       1          1718.15    1718.15 15.12      3.02E-02
             Residual       3           340.92     113.64
                Total       4          2059.07
                       Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
             Intercept    1.50           6.70       0.22 8.37E-01     -19.83     22.83
          Dosage (mg)     7.18           1.85       3.89 3.02E-02      1.30      13.06

     A second model, Wt=β0+β1Loge{D}+ε, is then run with the results shown in
     this second table. Respond to the questions that follow with regard to these
     two models. (Note: Loge{D} and LN{D} both represent the natural logarithm
     of D.)
             Regression Statistics
              Multiple R 0.9948
               R Square 0.9897
          Standard Error 2.6632
           Observations       5
          ANOVA
                            df             SS        MS       F    Significance F
            Regression       1          2037.79 2037.79 287.31        4.47E-04
              Residual       3           21.28       7.09
                  Total      4          2059.07
                        Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
              Intercept 12.76             1.26      10.12 2.06E-03       8.75     16.78
           LN{Dosage} 18.09               1.07      16.95 4.47E-04      14.70     21.49
A.   Both of these models are justified at α=0.05 based on significance test
     results shown in the two tables. Given only the evidence presented in
     these two tables (not the graph) which model would you select?
           ___________________________
     Give two reasons why you selected this particular model over the other
     model based on the regression results listed in the two tables. (Ignore the
     significance test results and the graph in responding to this question.)
           1.   ____________________________________
           2.   ____________________________________
B.   What would be the estimated weight gain in grams for dosage of 4.0
     mg?
           1.   Using the first model it is:                         _______
           2.   Using the second it is:                              _______
C.   Address the following questions using the information provided in the
     residual and QQ plots shown here below for the two models.




          1.    Given only the evidence presented in these four graphs which
                model would you select?
                     ___________________________
          2.    Give two reasons why you selected this particular model over
                the other model given here based only on the evidence
                provided in these four graphs.
                     a.    ______________________________
                     b.    ______________________________
15. Birch tree seedlings grown in peat-filled containers were outplanted in August
    in a dry sandy soil. After four weeks
    the planted seedlings were examined
    for the level of mortality. It was
    observed that the percentage mortality
    (PM) was dependent upon the
    percentage water content (PWC) of the
    container peat at the time of planting.
    This relationship was plotted and a
    least-squares fit of PM=β0+β1PWC+ε
    was made. The observed values are
    shown on the chart along with the regression equation and its R2 value.

     A.   Calculate the PM when the PWC is 45%:                         _________
     B.   What is the correlation coefficient value?                    _________
     C.   Very roughly plot the residuals on the chart provided here below.




     D.   What evidence is there for, or against, using a higher power (greater than
          the first) polynomial for the regression equation?
          ___________________________________________________
          ___________________________________________________
     E.   The procedure being used here in deciding upon the "form" of the
          relationship between PM and PWC is known as (circle one):
                (a)   Mechanistic           (b)   Approximation

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:19
posted:11/6/2012
language:English
pages:17