Document Sample

Examples Week 9 1. Examine the general patterns exhibited in the following charts and respond to the questions that follow. (Do not count the number of observations shown in each chart or scrutinize specific data points, just observe the general pattern.) The independent variable is indicated by X, the dependent by Y, and the residual by R. CHART NUMBER 1 CHART NUMBER 2 CHART NUMBER 3 Y Y Y X X X CHART NUMBER 4 CHART NUMBER 5 CHART NUMBER 6 R 0 R 0 R 0 X X X A. Assume that a straight line has been fit to each of the raw data plots shown in charts 1, 2 and 3. Make the best match of the most probable residual plot associated with each regression. Each residual plot (charts 4, 5 and 6) is to be assigned only once. Chart 1: Chart: __ Chart 2: Chart: __ Chart 3: Chart: __ B. Which residual chart best illustrates heteroscedasticity? Chart __ C. The simple linear regression of which data set is most likely to have a negative coefficient of correlation, r? Chart__ D. Which data set is the best candidate for fitting a quadratic relationship? Chart__ E. Which data set is the best candidate for a non-linear model based on data transformation? Chart__ 2. A study was conducted on how the yield of tomatoes (Y kilograms per hectare) varied with the amount of fertilizer applied (X kilograms per hectare). The amount of fertilizer applied was varied from none to 5000 kg/ha in this experiment. The hypothesized model was: Y=β0+β1X+ε. A simple least- squares fit of the data yielded the following regression program output: df SS MS F Signif F Regression 1 7860430 7860430 109 1.07E-06 Residual 10 721080 72108 Total 11 8581510 Coeff Std Error t Stat P-value Lower 95% Upper 95% Intercept 592.96 137.42 4.31 1.53E-03 286.8 899.2 Fertilizer 0.484 0.046 10.44 1.07E-06 0.38 0.59 A. What is the expected yield of tomatoes if 5000 kilograms of fertilizer are applied per hectare? _________ B. In order to test whether the regression is significant at the 0.05 level of significance we can compare the Ftest value with the Ftable value of: _________ C. In the test of part B above it is decided to (circle one): ACCEPT REJECT the null hypothesis. D. Write out the null hypothesis of B above using the symbols of our text. H0: ______________ E. It is almost certain that the mean yield estimate calculated in part A above is too high? Why? __________________________________________________________ 3. Observations as listed below were collected on randomly selected individuals. The relationship between hearing loss (Y, measured in decibels) and exposure time (X, measured in weeks) to high noise levels was examined using simple linear regression. Y 11.9 12.6 12.7 13.2 13.8 14.0 14.1 14.6 14.7 14.8 15.1 15.3 X 12 19 31 43 47 56 74 75 116 160 164 178 Standard Coefficients t Stat Error Intercept 12.55 0.251 50.00 X 0.0166 0.00254 6.55 A. What is the expected level of hearing loss if an individual is exposed to these high noise levels for a period of 52 weeks? _____________ B. For a test of H0: B1=0 against the alternative H1: B10, at the α=0.05 level answer the following: 1. The table value of the test statistic is: _____________ 2. The decision is to: (Circle one.) Reject H0 Accept H0 3. Using this equation (the form of which was selected on an empirical basis) to predict hearing loss for a 6 year exposure period now seems quite justified (circle one). Agree Disagree 4. Answer the following questions by either writing in the correct word or phrase or circling the number of the correct answer. A. When contemplating the use of simple linear regression on a set of data the typical first step in the analysis consists of ___________________ the data. B. When prior, well-established theory suggests a theoretical form for the regression equation and this form is fit to the observed data we are using a model selection approach referred to as being (circle one): 1. (rational or mechanistic) 2. (empirical or approximate) 3. (neither of these two) C. When the variance of the error term in linear regression is constant across all values of the independent variable it is referred to as being ______________________ D. When a very large sample is taken and a hypothesis about the value of the population mean is to be tested using student's t it is important that the observations (circle one): 1. (be taken from a normally distributed population) 2. (be independently distributed) 3. (meet both of these two listed conditions). 5. Match each of the lettered terms below with the phrase most closely related to it (each term is to be used only once). A. Multiple linear regression B. Confidence interval C. Hypothesis testing D. Simple linear regression E. Correlation matrix _____ involves mutually exclusive "states of nature" _____ identifying collinearity between variables _____ involves more than one independent variable _____ linear in the coefficients _____ requires information about the variance 6. A regression has been developed to predict gasoline mileage for a car based on its weight in pounds. Observed car weights run from 2000 to 4000 pounds and mileage from 15 to 35 per gallon. The following output has been obtained: Regression Statistics R Square 0.82 Standard Error 1.94 Observations 89 ANOVA df SS MS F Signif F Regression 1 1484.55 1484.55 396.4 3.77E-34 Residual 87 325.86 3.75 Total 88 1810.40 Coeff Std Error t Stat P-value Intercept 47.89 1.25 38.2 3.48E-56 Weight -0.0079 0.00039 -19.9 3.77E-34 Answer the following questions based on this output. A. For a car weighing 2500 pounds the predicted mileage is: ________ B. A point estimate for the variance of (epsilon) is: ________ 7. Consider the following SLR plots that can be a part of the output when fitting the equation Y=β0+β1X+ε to 3 different sets of data. (Each of these plots is to be used once, and only once, in responding to the 3 questions that follow.) A. Which of these plots comes closest to suggesting that a quadratic equation might best fit the data? _____________ B. Which of these plots is most suggestive that a good model has already been selected for the data? _____________ C. Which of these plots comes closest to suggesting that a cubic equation might best fit the data? _____________ 8. Answer the following questions about simple linear regression (circle or write in the answer as required): A. From the formula [Σei2/(n-2)]½ a point estimate of what parameter may be obtained (symbol)? _____________ B. What three key assumptions are typically required about the nature of the errors around the regression line? ___________________________________________________ ___________________________________________________ ___________________________________________________ C. When using SLR to approximate an unknown functional form we are typically most interested in testing which of the following null hypotheses: (1) H0: β0=0 (2) H0: β1=0 (3) Equally interested in both D. Simple linear regression can never be used when fitting any curvilinear relationship between a dependent and an independent variable; i.e., it must be a linear relationship between X and Y: T F E. A negative coefficient of correlation suggests: (1) β1>0 (2) heteroscedasticity (3) Neither 9. Answer the following questions about the linear regression output given below which has been obtained when fitting the model: P = [β0+β1*A*M+ε]-1 where P is the purchase price (in thousands of dollars) paid when buying a Datzun Z car that has an age A (in years) and has been driven a distance M (in thousands of miles) at the time of sale. Assume that these observations were randomly selected from all Datzun Z car sales reported to the department of motor vehicles in the state of California during a five year period. Regression Statistics Multiple R 0.96 R Square 0.91 Standard Error 0.0012 Observations 11 ANOVA df SS MS F Signif F Regression 1 0.000133 0.000133 94.2 4.58E-06 Residual 9 1.27E-05 1.406E-06 Total 10 0.0001452 Coeff Std Error t Stat P-value Lower 95% Upper 95% Intercept 0.00564 0.000780 7.22 4.95E-05 0.0039 0.0074 Age*Miles 2.19E-05 2.25E-06 9.71 4.58E-06 1.68E-05 2.69E-05 A. How much of the total variability observed in 1/P is explained by the hypothesized form of the relationship with A and M? ________ B. What is the best estimate, based on this regression, of the average purchase price of Datzun Z cars when bought new? ________ C. What is (are) the critical value(s) for a t-test of H0:β1=0 against H1:β1≠0 when using a type-one error level of one percent? ________ D. When running the test in question C for this problem above why should you be cautious about the result? _________________________________________________________ 10. This table and the 2 graphs below it contain SLR output based on data from a study of Holstein-Friesan milk cows. Respond to the questions that follow by referring to the table entries and the plotted information about the residuals. Regression Statistics Multiple R 0.98 R Square 0.96 Std Error 0.0420 Obs 14 ANOVA df SS MS F Signif F Regression 1 0.4602 0.4602 260.48 1.68E-09 Residual 12 0.0212 0.0018 Total 13 0.4814 Lower Upper Coeff Std Error t Stat P-value 95% 95% Intercept 0.176 0.0464 3.78 2.60E-03 0.0745 0.277 X 0.0246 0.0015 16.14 1.68E-09 0.0213 0.0279 A. Write out the linear equation relating the independent variable, milk production (kg/day), to the dependent variable, milk protein (kg/day). ___________________________________________ B. The R-square value of 0.9560 in the table indicates that 95.6% of the total variability in __________________ is explained by this ______________ relationship with __________________. C. Assuming that a model of the form Y=β0+β1X+ε with ε~N(0,σ2) is appropriate, and given that it has been fit as shown in the table above, the best point estimate of the std.dev. of the pop. error around the line is: _________ D. Setting the type I error at 5% we (circle one) can cannot reject H0 β1 =0.0 in favor of H1 β1 ≠0.0 E. The 95% CI for the population value for the slope of the line is: _________ F. What is the estimated average amount of milk protein (kg/day) to be expected from cows producing 30 kg of milk per day? _________ G. The accuracy of the estimated average amount of milk protein calculated for question F above is problematic due to extrapolation (circle one): True False H. The assumption of homoscedasticity in the residuals is justified because of: ________________________________________________________ I. The assumption of normality in the distribution of the residuals is justified because of: ________________________________________________________ J. In running the test on H0:=0.0 versus H1:≠0.0 with α=0.05 it is found that: ttest = ___________ and ttable = ±__________ K. In running the test on H0:β1 ≥ 0.03 versus H1:β1 < 0.03 with α=0.05 it is found that: ttest = ___________ and ttable = ±__________ 11. A simple linear regression model, Y = β0 + β1 X½ + ε, with ε~N(0,σ2), is run on a data set with the following results: Regression Statistics Multiple R 0.978 R Square 0.956 Std Error 0.679 Observations 10 ANOVA df SS MS F Signif F Regression 1 80.33 80.33 174.47 1.03E-06 Residual 8 3.68 0.46 Total 9 84.01 Coeff Std Error t Stat P-value Lower 95% Upper 95% Intercept 2.028 0.6908 2.94 0.019 0.435 3.621 SQRT(X) 4.877 0.3692 13.21 1.03E-06 4.026 5.729 Answer the following questions with regard to this model and the output presented here. A. Give the point estimate for each of the following parameters: 1. ρ ________ 2. σ ________ 3. β1 ________ 4. β0 ________ B. An estimate is needed for the value of the dependent variable when the independent variable is 49. 1. The best estimate given by this relationship is: ____________ 2. This estimate is suspect due to the fact that it is based on an (one word) _________________________ C. A test is run on: H0: β0 = 0 versus H1: β0 ≠ 0 with α=0.01. The decision is to: ____________ D. A test is run on: H0: β1 ≤ 4.0 versus H1: β1 > 4.0 with α=0.05. 1. The value of the test statistic is: _______________ 2. The table value is: _______________ E. The following assumptions are justified by what evidence given here (be specific in your interpretation)? 1. Homoscedasticity ________________________________________________ 2. Normality ________________________________________________ 12. It is suspected that the protein content (percentage) of wheat varies with the yield (bushels per acre). Data are analyzed with the following results when SLR is run using the standard model: Protein = β0 + β1 * Yield + ε, with ε ~ Normal(0,σ2). Multiple R 0.781 R Square 0.611 Std Error 1.420 Observations 19 ANOVA df SS MS F Signif F Regression 1 53.76 53.76 26.66 7.80E-05 Residual 17 34.29 2.02 Total 18 88.05 Coeff Std Error t Stat P-value Lower 90.0% Upper 90.0% Intercept 16.0536 0.7542 21.2868 1.08E-13 14.74 17.37 Yield -0.1585 0.0307 -5.1629 7.80E-05 -0.21 -0.11 A. What is the expected percentage protein content when the yield per acre is 30 bushels? _______ B. What proportion of the total variation in protein content is explained by the linear relationship with yield? _______ C. What is the point estimate for σ? _______ -5 D. The p-value of 7.80*10 clearly supports rejecting what standard SLR null? (Give formal null statement.) _______ E. It could be surmised that a nonlinear equation form; e.g., a quadratic, might provide a good fit to these data. This conjecture is supported by the: ____________________________________________________ 13. Houseflies are known to generally emerge faster at higher temperatures. Observations on number of days, Y, to emergence at different temperatures, X, have been obtained and simple linear regression (SLR) has been run on two different forms, Form I: Y=β0+β1X+ε and Form II: Y-1=β0+β1X+ε. All important factors other than temperature have been either controlled or randomized in the collection of these data. Form I R Square 0.876 Standard Error 2.245 Observations 36 ANOVA df SS MS F Signif F Regression 1 1208. 1208. 239. 5.74E-17 Residual 34 171. 5. Total 35 1379. Coeff Std Error t Stat P-value Lower 95% Upper 95% Intercept 71.24 3.658 19.5 5.13E-20 63.81 78.67 Temp -0.732 0.0473 -15.5 5.74E-17 -0.828 -0.636 Form II R Square 0.934 Standard Error 0.0071 Observations 36 ANOVA df SS MS F Signif F Regression 1 0.02431 0.0243 483. 1.13E-21 Residual 34 0.001712 5. E-05 Total 35 0.02602 Coeff Std Error t Stat P-value Lower 95% Upper 95% Intercept -0.175 0.01156 -15.1 1.15E-16 -0.198 -0.151 Temp 0.00328 0.0001494 22.0 1.13E-21 0.00298 0.00359 Answer the following questions based on the results given here. A. A prediction of the emergence time (in days) is to be made for the case where temperature is held at 75ºF. 1. The predicted time using Form I is (give answer to 1 decimal place): ________________ 2. The predicted time using Form II is (give answer to 1 decimal place): ________________ B. There is clear evidence that what assumption(s) of SLR has (have) been violated by Form II? _____________________ _____________________ C. One of the methods we have examined for choosing a form for the equation to be fit to the data is that of polynomial approximation to an unknown functional relationship between dependent and independent variable(s). Considering the SLR results for Form I it would appear worthwhile to try the form: ________________ D. The proportion of the total variance in the inverse of the observed emergence times that has been "explained" by the __________________________ with temperature is _____. E. Assume now that all necessary assumptions have been met to test the null hypothesis that β1 ≤ 0.003 against the alternative that β1 > 0.003 for Form II. Use the given numerical values and a 5% level of significance. 1. The value of the test statistic is: ________________ 2. The table value is: ________________ 14. A test was conducted in which five different groups of rats were maintained on a diet deficient in Vitamin A and then given supplementary rations of Vitamin A in the form of cod-liver oil. The dosage of cod-liver oil (D) given to each group of rats and the corresponding mean weight gains (Wt) are plotted in the graph. A simple linear regression (SLR) model, Wt=β0+β1D+ε, is run first with the results shown in the table immediately below. Regression Statistics Multiple R 0.9135 R Square 0.8344 Standard Error 10.6602 Observations 5 ANOVA df SS MS F Significance F Regression 1 1718.15 1718.15 15.12 3.02E-02 Residual 3 340.92 113.64 Total 4 2059.07 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 1.50 6.70 0.22 8.37E-01 -19.83 22.83 Dosage (mg) 7.18 1.85 3.89 3.02E-02 1.30 13.06 A second model, Wt=β0+β1Loge{D}+ε, is then run with the results shown in this second table. Respond to the questions that follow with regard to these two models. (Note: Loge{D} and LN{D} both represent the natural logarithm of D.) Regression Statistics Multiple R 0.9948 R Square 0.9897 Standard Error 2.6632 Observations 5 ANOVA df SS MS F Significance F Regression 1 2037.79 2037.79 287.31 4.47E-04 Residual 3 21.28 7.09 Total 4 2059.07 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 12.76 1.26 10.12 2.06E-03 8.75 16.78 LN{Dosage} 18.09 1.07 16.95 4.47E-04 14.70 21.49 A. Both of these models are justified at α=0.05 based on significance test results shown in the two tables. Given only the evidence presented in these two tables (not the graph) which model would you select? ___________________________ Give two reasons why you selected this particular model over the other model based on the regression results listed in the two tables. (Ignore the significance test results and the graph in responding to this question.) 1. ____________________________________ 2. ____________________________________ B. What would be the estimated weight gain in grams for dosage of 4.0 mg? 1. Using the first model it is: _______ 2. Using the second it is: _______ C. Address the following questions using the information provided in the residual and QQ plots shown here below for the two models. 1. Given only the evidence presented in these four graphs which model would you select? ___________________________ 2. Give two reasons why you selected this particular model over the other model given here based only on the evidence provided in these four graphs. a. ______________________________ b. ______________________________ 15. Birch tree seedlings grown in peat-filled containers were outplanted in August in a dry sandy soil. After four weeks the planted seedlings were examined for the level of mortality. It was observed that the percentage mortality (PM) was dependent upon the percentage water content (PWC) of the container peat at the time of planting. This relationship was plotted and a least-squares fit of PM=β0+β1PWC+ε was made. The observed values are shown on the chart along with the regression equation and its R2 value. A. Calculate the PM when the PWC is 45%: _________ B. What is the correlation coefficient value? _________ C. Very roughly plot the residuals on the chart provided here below. D. What evidence is there for, or against, using a higher power (greater than the first) polynomial for the regression equation? ___________________________________________________ ___________________________________________________ E. The procedure being used here in deciding upon the "form" of the relationship between PM and PWC is known as (circle one): (a) Mechanistic (b) Approximation

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 19 |

posted: | 11/6/2012 |

language: | English |

pages: | 17 |

OTHER DOCS BY liaoqinmei

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.