# Project _2 by shuifanglj

VIEWS: 0 PAGES: 11

• pg 1
```									Old project for STAT 870

What factors help determine a diamond’s price? The purpose of this project is to determine which
factors and to predict price. Below is a partial listing of a data set that contains the price, carats,
color, clarity, and certification body of 308 round cut diamonds sampled from a publication in 2000.

Carat Color Clarity Certify         Price
0.3   D     VS2     GIA           \$745.92
0.3   E     VS1     GIA           \$865.08
0.3   G    VVS1     GIA           \$865.08
0.3   G     VS1     GIA           \$721.86
0.31   D     VS1     GIA           \$940.13
                                
1.09    I   VVS2 HRD              \$5,217.42

For a description of carats, color, and clarity, see the diamond guide included at the end of this
certification body which determines the carat, clarity, and color. In this data set, color, clarity, and
certification body takes on the following values:

Variable      Values
Color        D, E, F, G, H, or I
Clarity      IF, VVS1, VVS2, VS1, or VS2
Gemmological Institute of America (GIA),
Certify     International Gemmological Institute (IGI), or

The data is stored in an Excel file called diamond_prices.xls which can be downloaded from the
course website. Note that the data is located in “Set1” of the Excel file.

Complete the following problems below. Within each part, include your R program output with code
inside of it and any additional information needed to explain your answer. You may need to edit your
output and code in order to make it look nice after you copy and paste it into your Word document.
Use =0.05 for all hypothesis tests, confidence intervals, and prediction intervals.
1) Please use X = carat and Y = Price only.
a) Perform an initial check of X and Y as described in Section 3.1 of the notes.

> #This is another way to read in the data, which is different from the read.xls()
function
> library(RODBC)
> z<-odbcConnectExcel("C:\\chris\\UNL\\STAT870\\projects\\fall2006\\
diamond_prices.xls")
> diamonds<-sqlFetch(z, "Set1")
> close(z)
carat color clarity certify    price
1 0.30       D    VS2     GIA 745.9184
2 0.30       E    VS1     GIA 865.0820
3 0.30       G   VVS1     GIA 865.0820
4 0.30       G    VS1     GIA 721.8565
5 0.31       D    VS1     GIA 940.1322
1
6    0.31         E       VS1                                         GIA 890.8626

>    mod.fit<-lm(formula = price ~ carat, data = diamonds)
>    save.it<-examine.mod.simple(mod.fit.obj = mod.fit, const.var.test = TRUE,
boxcox.find = TRUE)
>    save.it\$sum.data
Y                 X
Min.    : 365.5   Min.    :0.1800
1st Qu.: 931.0    1st Qu.:0.3500
Median :2414.8    Median :0.6200
Mean    :2875.7   Mean    :0.6309
3rd Qu.:4265.8    3rd Qu.:0.8500
Max.    :9171.0   Max.    :1.1000
Box plot                                               Dot plot

1.0

1.0
Predictor variable

Predictor variable
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
Box plot                                               Dot plot
2000 4000 6000 8000

2000 4000 6000 8000
Response variable

Response variable

Only the first plot created by examine.mod.simple() is shown here. There are no outlying
values for price and carat. The range of the x-values is 0.18 to 1.1.

b) Perform the first three steps of a residual analysis for the simple linear regression model that
includes Price and Carat as the response and predictor variables, respectively. At each step,
model and examining the independence of the i will not be examined for this project.

Residual
Step      Analysis                                                                       Comments                                                        Action to take
Linearity of the
Examined a plot of the residuals vs. carat and it shows a pattern among
regression
the residuals. The pattern looks to be possibly quadratic.
function
Examined a plot of the residuals vs. predicted values and it shows
1/3
Constant error      somewhat of a funnel pattern (variability in residuals is increasing as a                                                        Try a Y
variance                        ˆ
function of Y ). Levene test rejects constant variance. The Box-Cox                                                         transformation as
1                                                                 1/3
method suggests a Y transformation.                                                                       suggested by the Box-
Examined QQ-plot and histogram of semi-studentized residuals and                                                              Cox method
Normality of
they show that the normality assumption may be violated. Both show a
the i
large departure from normality on the right side.
Examined a plot of the semi-studentized residuals vs. predicted values
Outliers
and it shows 6 or more outliers. The largest is 5.90.
Examined a plot of the residuals vs. carat and it shows possibly a
Linearity of the       pattern among the residuals (less than what was found in step 1).                                                                   2
2       regression       There seems to be more variability in the positive residual area as carat
the model
function        increases. Given the plots seen in step #1 (and simply Y vs. X), it may
be of interest to try a quadratic transformation for carat.

2
Residual
Step       Analysis                                        Comments                                   Action to take
Examined a plot of the residuals vs. predicted values and there is not
as much of a funnel pattern. In this case, there seems to be more
Constant error                                                     ˆ
variance          variability in the positive residual area as Y increases. Levene test
rejects constant variance. Possibly a different transformation of Y
should be investigated.
Examined QQ-plot and histogram of the semi-studentized residuals.
Normality of
The plots look better than before, but the same type of problem exists
the i
(not as severe).
Examined a plot of the semi-studentized residuals vs. predicted values
Outliers
and it shows 3 observations a little larger than 3 (but less than 4).
Linearity of the
Examined a plot of the residuals vs. carat and it shows possibly a
regression
pattern among the residuals very similar to what was found in step #2
function
Examined a plot of the residuals vs. predicted values and a similar
Constant error       pattern as in step #2 is found again. Levene test rejects constant
Try a different
3       variance              variance. Possibly a different transformation of Y should be
transformation of Y
investigated.
Normality of       Examined QQ-plot and histogram of the semi-studentized residuals.
the i                     The plots are similar to what was found in #2.
Examined a plot of the semi-studentized residuals vs. predicted values.
Outliers
Similar numbers of outliers (and locations) as in step #2 are found.

>   ##############################
>   #STEP #1
>   mod.fit<-lm(formula = price ~ carat, data = diamonds)
>   summary(mod.fit)
Call:
lm(formula = price ~ carat, data = diamonds)

Residuals:
Min      1Q             Median          3Q         Max
-1297.47 -346.19             -66.53      249.30     3776.27

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1316.73      90.82 -14.50    <2e-16 ***
carat         6645.02    131.83   50.41   <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 640.3 on 306 degrees of freedom
Multiple R-Squared: 0.8925,     Adjusted R-squared: 0.8922
F-statistic: 2541 on 1 and 306 DF, p-value: < 2.2e-16

>   save.it<-examine.mod.simple(mod.fit.obj = mod.fit, const.var.test = TRUE,
boxcox.find = TRUE)
>   save.it\$lambda.hat
[1] 0.33
>   save.it\$levene
Levene's Test for Homogeneity of Variance
Df F value    Pr(>F)
group    1 26.852 3.995e-07 ***
306
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>   max(abs(save.it\$semi.stud.resid))
[1] 5.9

3
Response vs. predictor                                                  Residuals vs. predictor                                              Residuals vs. observation number                                                      Histogram of semistud. residuals
2000 4000 6000 8000

3000

3000

0.4
Response variable

0.3
Residuals

Residuals

Density
1000

1000

0.2
0

0

0.1
-1000

-1000

0.0
0.2        0.4    0.6      0.8      1.0                                 0.2     0.4       0.6      0.8      1.0                                     0    50    100 150      200   250 300                                                     -2        0       2        4   6

Predictor variable                                                      Predictor variable                                                       Observation number                                                                   Semistud. residuals

*
Residuals vs. estimated mean response                                                         ei vs. estimated mean response                                                        Normal Q-Q Plot                                                              Box-Cox transformation plot
95%
6

6
116 131

-3400 -3200 -3000 -2800
3000

279
Semistud. residuals

Semistud. residuals
4

278 294

4
110

log-Likelihood
Residuals

2
1000

2
0
0

0
-1000

-2

-2
0     1000         3000           5000                                  0     1000         3000           5000                                      -3   -2     -1     0      1       2   3                                              -2        -1       0        1       2

Estimated mean response                                              Estimated mean response                                                        Theoretical Quantiles

>          ##############################
>          #STEP #2
>
>          mod.fit2<-lm(formula = I(price^(1/3)) ~ carat, data = diamonds)
>          summary(mod.fit2)

Call:
lm(formula = I(price^(1/3)) ~ carat, data = diamonds)

Residuals:
Min       1Q                                             Median                           3Q                 Max
-1.59498 -0.44873                                            0.01354                      0.38392             2.95759

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    5.7922    0.1078   53.75   <2e-16 ***
carat         12.0614    0.1564   77.11   <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7597 on 306 degrees of freedom
Multiple R-Squared: 0.951,      Adjusted R-squared: 0.9509
F-statistic: 5945 on 1 and 306 DF, p-value: < 2.2e-16

>   save.it2<-examine.mod.simple(mod.fit.obj = mod.fit2, const.var.test = TRUE,
boxcox.find = TRUE)
>   save.it2\$lambda.hat
[1] 0.98

>   save.it2\$levene
Levene's Test for Homogeneity of Variance
Df F value   Pr(>F)
group   1   17.31 4.13e-05 ***
306
4
---
Signif. codes:                         0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>   max(abs(save.it2\$semi.stud.resid))
[1] 3.89
Response vs. predictor                                                          Residuals vs. predictor                                            Residuals vs. observation number                                    Histogram of semistud. residuals

0.5
3

3
10 12 14 16 18 20

0.4
Response variable

2

2
Residuals

Residuals

0.3
Density
1

1

0.2
0

0

0.1
-1

-1
8

0.0
0.2        0.4        0.6        0.8   1.0                                    0.2       0.4        0.6        0.8    1.0                                  0    50    100 150      200   250 300                                   -2   -1   0       1   2       3   4

Predictor variable                                                           Predictor variable                                                      Observation number                                                  Semistud. residuals

*
Residuals vs. estimated mean response                                                             ei vs. estimated mean response                                                        Normal Q-Q Plot                                            Box-Cox transformation plot
95%
4

4
3

131
110        116
3

-850
3
Semistud. residuals

Semistud. residuals
2

2

log-Likelihood
2
Residuals

1
1

1

-950
0
0

0
-3 -2 -1

-1
-1

-1050
-2

8         10     12         14    16   18                                     8        10     12         14     16   18                                   -3   -2     -1     0      1       2   3                            -2        -1       0           1       2

Estimated mean response                                                      Estimated mean response                                                      Theoretical Quantiles

>           #############################
>           #STEP #3

>           mod.fit3<-lm(formula = I(price^(1/3)) ~ carat + I(carat^2), data = diamonds)
>           summary(mod.fit3)

Call:
lm(formula = I(price^(1/3)) ~ carat + I(carat^2), data = diamonds)

Residuals:
Min        1Q                                                        Median                               3Q                                  Max
-1.696707 -0.427723                                                      0.002276                         0.397153                             3.161320

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    4.7856    0.2276   21.03 < 2e-16 ***
carat         15.9938    0.8054   19.86 < 2e-16 ***
I(carat^2)    -3.1064    0.6250   -4.97 1.12e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7319 on 305 degrees of freedom
Multiple R-Squared: 0.9547,     Adjusted R-squared: 0.9544
F-statistic: 3215 on 2 and 305 DF, p-value: < 2.2e-16

>  #note that the response vs. predictor variable plot will not be meaningful here.
>   save.it3<-examine.mod.simple(mod.fit.obj = mod.fit3, const.var.test = TRUE,
boxcox.find = TRUE)
>   save.it3\$lambda.hat
[1] -0.08

5
>   save.it3\$levene
Levene's Test for Homogeneity of Variance
Df F value    Pr(>F)
group   1 17.075 4.644e-05 ***
306
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>   max(abs(save.it3\$semi.stud.resid))
[1] 4.32

Note that the response vs. predictor variable plot is not meaningful now since it is not plotting
the model with b2carat2
Response vs. predictor                                                    Residuals vs. predictor                                                  Residuals vs. observation number                                   Histogram of semistud. residuals
3

3
10 12 14 16 18 20

0.4
Response variable

2

2

0.3
Residuals

Residuals

Density
1

1

0.2
0

0

0.1
-1

-1
8

0.0
0.2     0.4       0.6        0.8        1.0                               0.2     0.4       0.6        0.8        1.0                                     0    50    100 150      200   250 300                                  -2   -1    0   1    2       3   4

Predictor variable                                                        Predictor variable                                                           Observation number                                                 Semistud. residuals

*
Residuals vs. estimated mean response                                                         ei vs. estimated mean response                                                            Normal Q-Q Plot                                           Box-Cox transformation plot
95%
116       131
3

4

4

110
279

-780
Semistud. residuals

Semistud. residuals

3
2

278

log-Likelihood
2

2
Residuals

1

1

-820
0
0

0
-1
-1

-2

-860
-2

8      10      12      14         16     18                               8      10      12      14         16     18                                    -3   -2     -1     0      1       2   3                           -2         -1       0        1           2

Estimated mean response                                                   Estimated mean response                                                            Theoretical Quantiles

6

Residual
Step       Analysis                                         Comments                                         Action to take
Linearity of the
Examined a plot of the residuals vs. carat and it shows a pattern of more
regression
variability in the residuals as carat increases.
function
Examined a plot of the residuals vs. predicted values and there is a funnel
Constant error     pattern. Also, Levene test rejects constant variance. The Box-Cox method
variance                                                               ˆ
suggests a log() transformation (  = -0.03).
2                                                                                                            Use log(price)
Examined QQ-plot and histogram of the semi-studentized residuals. The
Normality of the          QQ-plot shows strong deviations from normality on the tails of the
i            distribution (right side is definitely showing large deviations). The histogram
is giving similar evidence, but not necessarily as obvious.
Examined a plot of the semi-studentized residuals vs. predicted values and
Outliers
it shows 7 or more outliers. The largest is 6.03
Linearity of the
Examined a plot of the residuals vs. carat and it is a random scattering of
regression
points.
function
Examined a plot of the residuals vs. predicted values and there is no funnel
Constant error
pattern. Levene test does not reject constant variance. The Box-Cox             Examine the 3
variance
3                          method suggests no transformation is needed with a C.I. for  containing 1.      possible outliers
Examined QQ-plot and histogram of the semi-studentized residuals. Both             more closely
Normality of the
look much better than before. The QQ-plot is showing a little deviation from
i
normality on the right tail.
Examined a plot of the semi-studentized residuals vs. predicted values and
Outliers
it shows 3 points a little above 3.

Step #2 code and output (note that the response vs. predictor variable plot is not meaningful now
since it is not plotting the model with b2carat2)

>   mod.fit2<-lm(formula = price ~ carat + I(carat^2), data = diamonds)
>   summary(mod.fit2)

Call:
lm(formula = price ~ carat + I(carat^2), data = diamonds)

Residuals:
Min       1Q              Median            3Q           Max
-1236.634 -243.633              -6.749       162.173      3514.703

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    -24.35    181.25 -0.134    0.8932
carat         1596.16    641.43   2.488   0.0134 *
I(carat^2)    3988.38    497.75   8.013 2.4e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 582.9 on 305 degrees of freedom
Multiple R-Squared: 0.9112,     Adjusted R-squared: 0.9106
F-statistic: 1565 on 2 and 305 DF, p-value: < 2.2e-16

>   save.it2<-examine.mod.simple(mod.fit.obj = mod.fit2, const.var.test = TRUE,
boxcox.find = TRUE)
>   save.it2\$lambda.hat
[1] -0.03

7
>   save.it2\$levene
Levene's Test for Homogeneity of Variance
Df F value    Pr(>F)
group   1 63.656 2.996e-14 ***
306
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>         save.it2\$bp

Breusch-Pagan test

data: Y ~ X
BP = 100.9756, df = 1, p-value < 2.2e-16

>   max(abs(save.it2\$semi.stud.resid))
[1] 6.03
Response vs. predictor                                               Residuals vs. predictor                                         Residuals vs. observation number                                    Histogram of semistud. residuals
3000

0.4
3000
2000 4000 6000 8000
Response variable

0.3
Residuals

Residuals

Density
1000

1000

0.2
0

0

0.1
-1000

-1000

0.0
0.2     0.4    0.6      0.8      1.0                                 0.2     0.4    0.6      0.8      1.0                                   0    50    100 150      200   250 300                                   -2        0   2      4       6

Predictor variable                                                   Predictor variable                                                     Observation number                                                 Semistud. residuals

*
Residuals vs. estimated mean response                                                       ei vs. estimated mean response                                                   Normal Q-Q Plot                                            Box-Cox transformation plot
95%

-2700
131
6

6
3000

116
110      120 279
Semistud. residuals

Semistud. residuals
4

294
4

-2900
278

log-Likelihood
Residuals

2
1000

2

-3100
0
0

0
-2
-1000

-3300
-2

1000         3000        5000                                        1000         3000       5000                                          -3   -2     -1     0      1       2   3                            -2        -1       0          1       2

Estimated mean response                                              Estimated mean response                                                   Theoretical Quantiles

Step #3 code and output (note that the response vs. predictor variable plot is not meaningful now
since it is not plotting the model with b2carat2)

>       mod.fit3<-lm(formula = log(price) ~ carat + I(carat^2), data = diamonds)
>       summary(mod.fit3)

Call:
lm(formula = log(price) ~ carat + I(carat^2), data = diamonds)

Residuals:
Min        1Q    Median                                                                     3Q                                    Max
-0.451872 -0.088577 -0.004409                                                               0.096854                               0.500447

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    5.2235    0.0483 108.15    <2e-16 ***
carat          5.4368    0.1709   31.81   <2e-16 ***
I(carat^2)    -2.0501    0.1326 -15.46    <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
8
Residual standard error: 0.1553 on 305 degrees of freedom
Multiple R-Squared: 0.9639,     Adjusted R-squared: 0.9636
F-statistic: 4066 on 2 and 305 DF, p-value: < 2.2e-16

>   save.it3<-examine.mod.simple(mod.fit.obj = mod.fit3, const.var.test = TRUE,
boxcox.find = TRUE)
>   save.it3\$lambda.hat
[1] 0.79

>   save.it3\$levene
Levene's Test for Homogeneity of Variance
Df F value Pr(>F)
group   1   0.001 0.975
306

>             save.it3\$bp

Breusch-Pagan test

data: Y ~ X
BP = 0.1054, df = 1, p-value = 0.7455

>   max(abs(save.it3\$semi.stud.resid))
[1] 3.22

Response vs. predictor                                                         Residuals vs. predictor                                                        Residuals vs. observation number                                   Histogram of semistud. residuals

0.5
9.0

0.4

0.4

0.4
Response variable

0.2

0.2
8.0

Residuals

Residuals

0.3
Density
0.0

0.0

0.2
7.0

-0.4 -0.2

-0.4 -0.2

0.1
6.0

0.0
0.2         0.4         0.6     0.8    1.0                                     0.2         0.4         0.6     0.8         1.0                                       0    50    100 150      200   250 300                             -3     -2        -1   0       1   2   3

Predictor variable                                                             Predictor variable                                                                Observation number                                                  Semistud. residuals

*
Residuals vs. estimated mean response                                                      ei vs. estimated mean response                                                                  Normal Q-Q Plot                                           Box-Cox transformation plot
131
95%
-310
3

3
0.4

110    116
Semistud. residuals

Semistud. residuals
2

2
0.2

-320
log-Likelihood
1
Residuals

1
0.0

0

0

-330
-1

-1
-0.4 -0.2

-340
-2

-2
-3

-3

6.5         7.0     7.5    8.0   8.5                                           6.5         7.0     7.5    8.0        8.5                                       -3   -2     -1     0      1       2   3                           -2          -1            0       1       2

Estimated mean response                                                      Estimated mean response                                                                 Theoretical Quantiles

Suppose a log() transformation of price is done first instead. Without going through all of the R
code and output here, the next step suggested by the residual analysis would be to add carat 2
to the model resulting in the same model as above.

c) (3 points) State the sample model obtained in step 3.

For the rest of this problem, I will be working with the model obtained from the second and
third method in my part b).

9
log(price)  5.2235  5.4368carat  2.0501carat2

d) Suppose you have chosen the correct regression model and all of the model’s assumptions
are satisfied! Using a standard normal distribution approximation for the semi-studentized
residuals, approximately how many observations would you expect to detect as outliers?
Remember that an outlier is roughly defined as a semi-studentized residual outside of the –3
and +3 bounds. Note that you do not need anything from the data set other than n=308 to

Let Z be a standard normal random variable. Then P(Z<-3 or Z>3) = 0.0027. Using the
binomial distribution, let a “success” be defined as a semi-studentized residual outside of the
-3 and +3 bounds. The mean for a binomial random variable is =n. In this case, n=308,
=0.0027, and =3080.0027 = 0.8316. Thus, one would expect to find about 1 semi-
studentized residual outside of the -3 and +3 bounds.

One could go further to answer the question. Using the rule of thumb for the number of
standard deviations all data lies from its mean, a binomial random variable should be observed
between 3 where = n(1 ) . Thus, one would expect the number of semi-studentized
residuals outside of -3 and +3 to be 0.83163 308  0.0027(1 0.0027) = 0.8316  2.7321 =
(-1.90, 3.56). Notice that there were 3 semi-studentized residuals outside of the –3 and +3
bounds. Thus, this is a likely occurrence and we should not be alarmed by seeing these
outliers.

e) How well is the model doing in predicting price? Explain your answer using R 2.

R2 = 0.9639 from output in part b).

The model is doing a very good job at predicting price since the R2 is close to 1.

f) Suppose two people are about to get engaged and they plan to purchase a diamond
engagement ring. You offer your statistical expertise to aid them in pricing diamonds.
Suppose the couple is interested in a 0.5 carat diamond. Using the resulting model stated in
c), what is the estimated price of a 0.5 carat diamond? Use a point estimate of price to answer
the question. How to calculate a prediction interval will be discussed later in the course.

The sample regression model is log(price) = 5.2235 + 5.4368Carat – 2.0502Carat2. Thus, the
estimate for price is

exp(5.2235 + 5.4368Carat – 2.0502Carat2) = exp[5.2235 + 5.4368(0.5) – 2.0502(0.5)2]
= \$1,684.79

>   exp(predict(object = mod.fit3, newdata = data.frame(carat = 0.5)))
[1] 1684.787

10
11

```
To top