VIEWS: 0 PAGES: 11 POSTED ON: 7/9/2011 Public Domain
Old project for STAT 870 What factors help determine a diamond’s price? The purpose of this project is to determine which factors and to predict price. Below is a partial listing of a data set that contains the price, carats, color, clarity, and certification body of 308 round cut diamonds sampled from a publication in 2000. Carat Color Clarity Certify Price 0.3 D VS2 GIA $745.92 0.3 E VS1 GIA $865.08 0.3 G VVS1 GIA $865.08 0.3 G VS1 GIA $721.86 0.31 D VS1 GIA $940.13 1.09 I VVS2 HRD $5,217.42 For a description of carats, color, and clarity, see the diamond guide included at the end of this document or see www.adiamondisforever.com/buy/4cs_print.html. Note that certify denotes the certification body which determines the carat, clarity, and color. In this data set, color, clarity, and certification body takes on the following values: Variable Values Color D, E, F, G, H, or I Clarity IF, VVS1, VVS2, VS1, or VS2 Gemmological Institute of America (GIA), Certify International Gemmological Institute (IGI), or Hoge Raad Voor Diamant (HRD) The data is stored in an Excel file called diamond_prices.xls which can be downloaded from the course website. Note that the data is located in “Set1” of the Excel file. Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. You may need to edit your output and code in order to make it look nice after you copy and paste it into your Word document. Use =0.05 for all hypothesis tests, confidence intervals, and prediction intervals. 1) Please use X = carat and Y = Price only. a) Perform an initial check of X and Y as described in Section 3.1 of the notes. > #This is another way to read in the data, which is different from the read.xls() function > library(RODBC) > z<-odbcConnectExcel("C:\\chris\\UNL\\STAT870\\projects\\fall2006\\ diamond_prices.xls") > diamonds<-sqlFetch(z, "Set1") > close(z) > head(diamonds) carat color clarity certify price 1 0.30 D VS2 GIA 745.9184 2 0.30 E VS1 GIA 865.0820 3 0.30 G VVS1 GIA 865.0820 4 0.30 G VS1 GIA 721.8565 5 0.31 D VS1 GIA 940.1322 1 6 0.31 E VS1 GIA 890.8626 > mod.fit<-lm(formula = price ~ carat, data = diamonds) > save.it<-examine.mod.simple(mod.fit.obj = mod.fit, const.var.test = TRUE, boxcox.find = TRUE) > save.it$sum.data Y X Min. : 365.5 Min. :0.1800 1st Qu.: 931.0 1st Qu.:0.3500 Median :2414.8 Median :0.6200 Mean :2875.7 Mean :0.6309 3rd Qu.:4265.8 3rd Qu.:0.8500 Max. :9171.0 Max. :1.1000 Box plot Dot plot 1.0 1.0 Predictor variable Predictor variable 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 Box plot Dot plot 2000 4000 6000 8000 2000 4000 6000 8000 Response variable Response variable Only the first plot created by examine.mod.simple() is shown here. There are no outlying values for price and carat. The range of the x-values is 0.18 to 1.1. b) Perform the first three steps of a residual analysis for the simple linear regression model that includes Price and Carat as the response and predictor variables, respectively. At each step, give written answers in the table format shown below. Adding an additional variable to the model and examining the independence of the i will not be examined for this project. Residual Step Analysis Comments Action to take Linearity of the Examined a plot of the residuals vs. carat and it shows a pattern among regression the residuals. The pattern looks to be possibly quadratic. function Examined a plot of the residuals vs. predicted values and it shows 1/3 Constant error somewhat of a funnel pattern (variability in residuals is increasing as a Try a Y variance ˆ function of Y ). Levene test rejects constant variance. The Box-Cox transformation as 1 1/3 method suggests a Y transformation. suggested by the Box- Examined QQ-plot and histogram of semi-studentized residuals and Cox method Normality of they show that the normality assumption may be violated. Both show a the i large departure from normality on the right side. Examined a plot of the semi-studentized residuals vs. predicted values Outliers and it shows 6 or more outliers. The largest is 5.90. Examined a plot of the residuals vs. carat and it shows possibly a Linearity of the pattern among the residuals (less than what was found in step 1). 2 Add a Carat term to 2 regression There seems to be more variability in the positive residual area as carat the model function increases. Given the plots seen in step #1 (and simply Y vs. X), it may be of interest to try a quadratic transformation for carat. 2 Residual Step Analysis Comments Action to take Examined a plot of the residuals vs. predicted values and there is not as much of a funnel pattern. In this case, there seems to be more Constant error ˆ variance variability in the positive residual area as Y increases. Levene test rejects constant variance. Possibly a different transformation of Y should be investigated. Examined QQ-plot and histogram of the semi-studentized residuals. Normality of The plots look better than before, but the same type of problem exists the i (not as severe). Examined a plot of the semi-studentized residuals vs. predicted values Outliers and it shows 3 observations a little larger than 3 (but less than 4). Linearity of the Examined a plot of the residuals vs. carat and it shows possibly a regression pattern among the residuals very similar to what was found in step #2 function Examined a plot of the residuals vs. predicted values and a similar Constant error pattern as in step #2 is found again. Levene test rejects constant Try a different 3 variance variance. Possibly a different transformation of Y should be transformation of Y investigated. Normality of Examined QQ-plot and histogram of the semi-studentized residuals. the i The plots are similar to what was found in #2. Examined a plot of the semi-studentized residuals vs. predicted values. Outliers Similar numbers of outliers (and locations) as in step #2 are found. > ############################## > #STEP #1 > mod.fit<-lm(formula = price ~ carat, data = diamonds) > summary(mod.fit) Call: lm(formula = price ~ carat, data = diamonds) Residuals: Min 1Q Median 3Q Max -1297.47 -346.19 -66.53 249.30 3776.27 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1316.73 90.82 -14.50 <2e-16 *** carat 6645.02 131.83 50.41 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 640.3 on 306 degrees of freedom Multiple R-Squared: 0.8925, Adjusted R-squared: 0.8922 F-statistic: 2541 on 1 and 306 DF, p-value: < 2.2e-16 > save.it<-examine.mod.simple(mod.fit.obj = mod.fit, const.var.test = TRUE, boxcox.find = TRUE) > save.it$lambda.hat [1] 0.33 > save.it$levene Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 1 26.852 3.995e-07 *** 306 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > max(abs(save.it$semi.stud.resid)) [1] 5.9 3 Response vs. predictor Residuals vs. predictor Residuals vs. observation number Histogram of semistud. residuals 2000 4000 6000 8000 3000 3000 0.4 Response variable 0.3 Residuals Residuals Density 1000 1000 0.2 0 0 0.1 -1000 -1000 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 250 300 -2 0 2 4 6 Predictor variable Predictor variable Observation number Semistud. residuals * Residuals vs. estimated mean response ei vs. estimated mean response Normal Q-Q Plot Box-Cox transformation plot 95% 6 6 116 131 -3400 -3200 -3000 -2800 3000 279 Semistud. residuals Semistud. residuals 4 278 294 4 110 log-Likelihood Residuals 2 1000 2 0 0 0 -1000 -2 -2 0 1000 3000 5000 0 1000 3000 5000 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 Estimated mean response Estimated mean response Theoretical Quantiles > ############################## > #STEP #2 > > mod.fit2<-lm(formula = I(price^(1/3)) ~ carat, data = diamonds) > summary(mod.fit2) Call: lm(formula = I(price^(1/3)) ~ carat, data = diamonds) Residuals: Min 1Q Median 3Q Max -1.59498 -0.44873 0.01354 0.38392 2.95759 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.7922 0.1078 53.75 <2e-16 *** carat 12.0614 0.1564 77.11 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7597 on 306 degrees of freedom Multiple R-Squared: 0.951, Adjusted R-squared: 0.9509 F-statistic: 5945 on 1 and 306 DF, p-value: < 2.2e-16 > save.it2<-examine.mod.simple(mod.fit.obj = mod.fit2, const.var.test = TRUE, boxcox.find = TRUE) > save.it2$lambda.hat [1] 0.98 > save.it2$levene Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 1 17.31 4.13e-05 *** 306 4 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > max(abs(save.it2$semi.stud.resid)) [1] 3.89 Response vs. predictor Residuals vs. predictor Residuals vs. observation number Histogram of semistud. residuals 0.5 3 3 10 12 14 16 18 20 0.4 Response variable 2 2 Residuals Residuals 0.3 Density 1 1 0.2 0 0 0.1 -1 -1 8 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 250 300 -2 -1 0 1 2 3 4 Predictor variable Predictor variable Observation number Semistud. residuals * Residuals vs. estimated mean response ei vs. estimated mean response Normal Q-Q Plot Box-Cox transformation plot 95% 4 4 3 131 110 116 3 -850 3 Semistud. residuals Semistud. residuals 2 2 log-Likelihood 2 Residuals 1 1 1 -950 0 0 0 -3 -2 -1 -1 -1 -1050 -2 8 10 12 14 16 18 8 10 12 14 16 18 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 Estimated mean response Estimated mean response Theoretical Quantiles > ############################# > #STEP #3 > mod.fit3<-lm(formula = I(price^(1/3)) ~ carat + I(carat^2), data = diamonds) > summary(mod.fit3) Call: lm(formula = I(price^(1/3)) ~ carat + I(carat^2), data = diamonds) Residuals: Min 1Q Median 3Q Max -1.696707 -0.427723 0.002276 0.397153 3.161320 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.7856 0.2276 21.03 < 2e-16 *** carat 15.9938 0.8054 19.86 < 2e-16 *** I(carat^2) -3.1064 0.6250 -4.97 1.12e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7319 on 305 degrees of freedom Multiple R-Squared: 0.9547, Adjusted R-squared: 0.9544 F-statistic: 3215 on 2 and 305 DF, p-value: < 2.2e-16 > #note that the response vs. predictor variable plot will not be meaningful here. > save.it3<-examine.mod.simple(mod.fit.obj = mod.fit3, const.var.test = TRUE, boxcox.find = TRUE) > save.it3$lambda.hat [1] -0.08 5 > save.it3$levene Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 1 17.075 4.644e-05 *** 306 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > max(abs(save.it3$semi.stud.resid)) [1] 4.32 Note that the response vs. predictor variable plot is not meaningful now since it is not plotting the model with b2carat2 Response vs. predictor Residuals vs. predictor Residuals vs. observation number Histogram of semistud. residuals 3 3 10 12 14 16 18 20 0.4 Response variable 2 2 0.3 Residuals Residuals Density 1 1 0.2 0 0 0.1 -1 -1 8 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 250 300 -2 -1 0 1 2 3 4 Predictor variable Predictor variable Observation number Semistud. residuals * Residuals vs. estimated mean response ei vs. estimated mean response Normal Q-Q Plot Box-Cox transformation plot 95% 116 131 3 4 4 110 279 -780 Semistud. residuals Semistud. residuals 3 2 278 log-Likelihood 2 2 Residuals 1 1 -820 0 0 0 -1 -1 -2 -860 -2 8 10 12 14 16 18 8 10 12 14 16 18 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 Estimated mean response Estimated mean response Theoretical Quantiles 6 Suppose a carat2 is added first to the model instead. Residual Step Analysis Comments Action to take Linearity of the Examined a plot of the residuals vs. carat and it shows a pattern of more regression variability in the residuals as carat increases. function Examined a plot of the residuals vs. predicted values and there is a funnel Constant error pattern. Also, Levene test rejects constant variance. The Box-Cox method variance ˆ suggests a log() transformation ( = -0.03). 2 Use log(price) Examined QQ-plot and histogram of the semi-studentized residuals. The Normality of the QQ-plot shows strong deviations from normality on the tails of the i distribution (right side is definitely showing large deviations). The histogram is giving similar evidence, but not necessarily as obvious. Examined a plot of the semi-studentized residuals vs. predicted values and Outliers it shows 7 or more outliers. The largest is 6.03 Linearity of the Examined a plot of the residuals vs. carat and it is a random scattering of regression points. function Examined a plot of the residuals vs. predicted values and there is no funnel Constant error pattern. Levene test does not reject constant variance. The Box-Cox Examine the 3 variance 3 method suggests no transformation is needed with a C.I. for containing 1. possible outliers Examined QQ-plot and histogram of the semi-studentized residuals. Both more closely Normality of the look much better than before. The QQ-plot is showing a little deviation from i normality on the right tail. Examined a plot of the semi-studentized residuals vs. predicted values and Outliers it shows 3 points a little above 3. Step #2 code and output (note that the response vs. predictor variable plot is not meaningful now since it is not plotting the model with b2carat2) > mod.fit2<-lm(formula = price ~ carat + I(carat^2), data = diamonds) > summary(mod.fit2) Call: lm(formula = price ~ carat + I(carat^2), data = diamonds) Residuals: Min 1Q Median 3Q Max -1236.634 -243.633 -6.749 162.173 3514.703 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -24.35 181.25 -0.134 0.8932 carat 1596.16 641.43 2.488 0.0134 * I(carat^2) 3988.38 497.75 8.013 2.4e-14 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 582.9 on 305 degrees of freedom Multiple R-Squared: 0.9112, Adjusted R-squared: 0.9106 F-statistic: 1565 on 2 and 305 DF, p-value: < 2.2e-16 > save.it2<-examine.mod.simple(mod.fit.obj = mod.fit2, const.var.test = TRUE, boxcox.find = TRUE) > save.it2$lambda.hat [1] -0.03 7 > save.it2$levene Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 1 63.656 2.996e-14 *** 306 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > save.it2$bp Breusch-Pagan test data: Y ~ X BP = 100.9756, df = 1, p-value < 2.2e-16 > max(abs(save.it2$semi.stud.resid)) [1] 6.03 Response vs. predictor Residuals vs. predictor Residuals vs. observation number Histogram of semistud. residuals 3000 0.4 3000 2000 4000 6000 8000 Response variable 0.3 Residuals Residuals Density 1000 1000 0.2 0 0 0.1 -1000 -1000 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 250 300 -2 0 2 4 6 Predictor variable Predictor variable Observation number Semistud. residuals * Residuals vs. estimated mean response ei vs. estimated mean response Normal Q-Q Plot Box-Cox transformation plot 95% -2700 131 6 6 3000 116 110 120 279 Semistud. residuals Semistud. residuals 4 294 4 -2900 278 log-Likelihood Residuals 2 1000 2 -3100 0 0 0 -2 -1000 -3300 -2 1000 3000 5000 1000 3000 5000 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 Estimated mean response Estimated mean response Theoretical Quantiles Step #3 code and output (note that the response vs. predictor variable plot is not meaningful now since it is not plotting the model with b2carat2) > mod.fit3<-lm(formula = log(price) ~ carat + I(carat^2), data = diamonds) > summary(mod.fit3) Call: lm(formula = log(price) ~ carat + I(carat^2), data = diamonds) Residuals: Min 1Q Median 3Q Max -0.451872 -0.088577 -0.004409 0.096854 0.500447 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.2235 0.0483 108.15 <2e-16 *** carat 5.4368 0.1709 31.81 <2e-16 *** I(carat^2) -2.0501 0.1326 -15.46 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 8 Residual standard error: 0.1553 on 305 degrees of freedom Multiple R-Squared: 0.9639, Adjusted R-squared: 0.9636 F-statistic: 4066 on 2 and 305 DF, p-value: < 2.2e-16 > save.it3<-examine.mod.simple(mod.fit.obj = mod.fit3, const.var.test = TRUE, boxcox.find = TRUE) > save.it3$lambda.hat [1] 0.79 > save.it3$levene Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 1 0.001 0.975 306 > save.it3$bp Breusch-Pagan test data: Y ~ X BP = 0.1054, df = 1, p-value = 0.7455 > max(abs(save.it3$semi.stud.resid)) [1] 3.22 Response vs. predictor Residuals vs. predictor Residuals vs. observation number Histogram of semistud. residuals 0.5 9.0 0.4 0.4 0.4 Response variable 0.2 0.2 8.0 Residuals Residuals 0.3 Density 0.0 0.0 0.2 7.0 -0.4 -0.2 -0.4 -0.2 0.1 6.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 250 300 -3 -2 -1 0 1 2 3 Predictor variable Predictor variable Observation number Semistud. residuals * Residuals vs. estimated mean response ei vs. estimated mean response Normal Q-Q Plot Box-Cox transformation plot 131 95% -310 3 3 0.4 110 116 Semistud. residuals Semistud. residuals 2 2 0.2 -320 log-Likelihood 1 Residuals 1 0.0 0 0 -330 -1 -1 -0.4 -0.2 -340 -2 -2 -3 -3 6.5 7.0 7.5 8.0 8.5 6.5 7.0 7.5 8.0 8.5 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 Estimated mean response Estimated mean response Theoretical Quantiles Suppose a log() transformation of price is done first instead. Without going through all of the R code and output here, the next step suggested by the residual analysis would be to add carat 2 to the model resulting in the same model as above. c) (3 points) State the sample model obtained in step 3. For the rest of this problem, I will be working with the model obtained from the second and third method in my part b). 9 log(price) 5.2235 5.4368carat 2.0501carat2 d) Suppose you have chosen the correct regression model and all of the model’s assumptions are satisfied! Using a standard normal distribution approximation for the semi-studentized residuals, approximately how many observations would you expect to detect as outliers? Remember that an outlier is roughly defined as a semi-studentized residual outside of the –3 and +3 bounds. Note that you do not need anything from the data set other than n=308 to answer this question. Let Z be a standard normal random variable. Then P(Z<-3 or Z>3) = 0.0027. Using the binomial distribution, let a “success” be defined as a semi-studentized residual outside of the -3 and +3 bounds. The mean for a binomial random variable is =n. In this case, n=308, =0.0027, and =3080.0027 = 0.8316. Thus, one would expect to find about 1 semi- studentized residual outside of the -3 and +3 bounds. One could go further to answer the question. Using the rule of thumb for the number of standard deviations all data lies from its mean, a binomial random variable should be observed between 3 where = n(1 ) . Thus, one would expect the number of semi-studentized residuals outside of -3 and +3 to be 0.83163 308 0.0027(1 0.0027) = 0.8316 2.7321 = (-1.90, 3.56). Notice that there were 3 semi-studentized residuals outside of the –3 and +3 bounds. Thus, this is a likely occurrence and we should not be alarmed by seeing these outliers. e) How well is the model doing in predicting price? Explain your answer using R 2. R2 = 0.9639 from output in part b). The model is doing a very good job at predicting price since the R2 is close to 1. f) Suppose two people are about to get engaged and they plan to purchase a diamond engagement ring. You offer your statistical expertise to aid them in pricing diamonds. Suppose the couple is interested in a 0.5 carat diamond. Using the resulting model stated in c), what is the estimated price of a 0.5 carat diamond? Use a point estimate of price to answer the question. How to calculate a prediction interval will be discussed later in the course. The sample regression model is log(price) = 5.2235 + 5.4368Carat – 2.0502Carat2. Thus, the estimate for price is exp(5.2235 + 5.4368Carat – 2.0502Carat2) = exp[5.2235 + 5.4368(0.5) – 2.0502(0.5)2] = $1,684.79 > exp(predict(object = mod.fit3, newdata = data.frame(carat = 0.5))) [1] 1684.787 10 11