Docstoc

Project _2

Document Sample
Project _2 Powered By Docstoc
					Old project for STAT 870

What factors help determine a diamond’s price? The purpose of this project is to determine which
factors and to predict price. Below is a partial listing of a data set that contains the price, carats,
color, clarity, and certification body of 308 round cut diamonds sampled from a publication in 2000.

                               Carat Color Clarity Certify         Price
                                0.3   D     VS2     GIA           $745.92
                                0.3   E     VS1     GIA           $865.08
                                0.3   G    VVS1     GIA           $865.08
                                0.3   G     VS1     GIA           $721.86
                               0.31   D     VS1     GIA           $940.13
                                                                 
                               1.09    I   VVS2 HRD              $5,217.42

For a description of carats, color, and clarity, see the diamond guide included at the end of this
document or see www.adiamondisforever.com/buy/4cs_print.html. Note that certify denotes the
certification body which determines the carat, clarity, and color. In this data set, color, clarity, and
certification body takes on the following values:

                   Variable      Values
                    Color        D, E, F, G, H, or I
                    Clarity      IF, VVS1, VVS2, VS1, or VS2
                                 Gemmological Institute of America (GIA),
                     Certify     International Gemmological Institute (IGI), or
                                 Hoge Raad Voor Diamant (HRD)

The data is stored in an Excel file called diamond_prices.xls which can be downloaded from the
course website. Note that the data is located in “Set1” of the Excel file.

Complete the following problems below. Within each part, include your R program output with code
inside of it and any additional information needed to explain your answer. You may need to edit your
output and code in order to make it look nice after you copy and paste it into your Word document.
Use =0.05 for all hypothesis tests, confidence intervals, and prediction intervals.
1) Please use X = carat and Y = Price only.
    a) Perform an initial check of X and Y as described in Section 3.1 of the notes.

       > #This is another way to read in the data, which is different from the read.xls()
           function
       > library(RODBC)
       > z<-odbcConnectExcel("C:\\chris\\UNL\\STAT870\\projects\\fall2006\\
                              diamond_prices.xls")
       > diamonds<-sqlFetch(z, "Set1")
       > close(z)
       > head(diamonds)
         carat color clarity certify    price
       1 0.30       D    VS2     GIA 745.9184
       2 0.30       E    VS1     GIA 865.0820
       3 0.30       G   VVS1     GIA 865.0820
       4 0.30       G    VS1     GIA 721.8565
       5 0.31       D    VS1     GIA 940.1322
                                                                                                           1
        6    0.31         E       VS1                                         GIA 890.8626

        >    mod.fit<-lm(formula = price ~ carat, data = diamonds)
        >    save.it<-examine.mod.simple(mod.fit.obj = mod.fit, const.var.test = TRUE,
                       boxcox.find = TRUE)
        >    save.it$sum.data
                Y                 X
          Min.    : 365.5   Min.    :0.1800
          1st Qu.: 931.0    1st Qu.:0.3500
          Median :2414.8    Median :0.6200
          Mean    :2875.7   Mean    :0.6309
          3rd Qu.:4265.8    3rd Qu.:0.8500
          Max.    :9171.0   Max.    :1.1000
                                                                                    Box plot                                               Dot plot




                                                              1.0




                                                                                                                     1.0
                                         Predictor variable




                                                                                                Predictor variable
                                                              0.8




                                                                                                                     0.8
                                                              0.6




                                                                                                                     0.6
                                                              0.4




                                                                                                                     0.4
                                                              0.2




                                                                                                                     0.2
                                                                                    Box plot                                               Dot plot
                                                              2000 4000 6000 8000




                                                                                                                     2000 4000 6000 8000
                                         Response variable




                                                                                                Response variable




        Only the first plot created by examine.mod.simple() is shown here. There are no outlying
        values for price and carat. The range of the x-values is 0.18 to 1.1.

     b) Perform the first three steps of a residual analysis for the simple linear regression model that
        includes Price and Carat as the response and predictor variables, respectively. At each step,
        give written answers in the table format shown below. Adding an additional variable to the
        model and examining the independence of the i will not be examined for this project.

          Residual
Step      Analysis                                                                       Comments                                                        Action to take
       Linearity of the
                          Examined a plot of the residuals vs. carat and it shows a pattern among
         regression
                                the residuals. The pattern looks to be possibly quadratic.
          function
                              Examined a plot of the residuals vs. predicted values and it shows
                                                                                                                                                                      1/3
       Constant error      somewhat of a funnel pattern (variability in residuals is increasing as a                                                        Try a Y
         variance                        ˆ
                            function of Y ). Levene test rejects constant variance. The Box-Cox                                                         transformation as
 1                                                                 1/3
                                            method suggests a Y transformation.                                                                       suggested by the Box-
                             Examined QQ-plot and histogram of semi-studentized residuals and                                                              Cox method
         Normality of
                          they show that the normality assumption may be violated. Both show a
            the i
                                       large departure from normality on the right side.
                           Examined a plot of the semi-studentized residuals vs. predicted values
            Outliers
                                     and it shows 6 or more outliers. The largest is 5.90.
                              Examined a plot of the residuals vs. carat and it shows possibly a
       Linearity of the       pattern among the residuals (less than what was found in step 1).                                                                   2
                                                                                                                                                      Add a Carat term to
 2       regression       There seems to be more variability in the positive residual area as carat
                                                                                                                                                           the model
          function        increases. Given the plots seen in step #1 (and simply Y vs. X), it may
                                   be of interest to try a quadratic transformation for carat.


                                                                                                                                                                            2
           Residual
Step       Analysis                                        Comments                                   Action to take
                          Examined a plot of the residuals vs. predicted values and there is not
                            as much of a funnel pattern. In this case, there seems to be more
       Constant error                                                     ˆ
         variance          variability in the positive residual area as Y increases. Levene test
                            rejects constant variance. Possibly a different transformation of Y
                                                     should be investigated.
                           Examined QQ-plot and histogram of the semi-studentized residuals.
        Normality of
                          The plots look better than before, but the same type of problem exists
           the i
                                                         (not as severe).
                          Examined a plot of the semi-studentized residuals vs. predicted values
           Outliers
                            and it shows 3 observations a little larger than 3 (but less than 4).
       Linearity of the
                            Examined a plot of the residuals vs. carat and it shows possibly a
         regression
                          pattern among the residuals very similar to what was found in step #2
          function
                            Examined a plot of the residuals vs. predicted values and a similar
       Constant error       pattern as in step #2 is found again. Levene test rejects constant
                                                                                                       Try a different
 3       variance              variance. Possibly a different transformation of Y should be
                                                                                                    transformation of Y
                                                        investigated.
        Normality of       Examined QQ-plot and histogram of the semi-studentized residuals.
           the i                     The plots are similar to what was found in #2.
                          Examined a plot of the semi-studentized residuals vs. predicted values.
           Outliers
                            Similar numbers of outliers (and locations) as in step #2 are found.

       >   ##############################
       >   #STEP #1
       >   mod.fit<-lm(formula = price ~ carat, data = diamonds)
       >   summary(mod.fit)
       Call:
       lm(formula = price ~ carat, data = diamonds)

       Residuals:
            Min      1Q             Median          3Q         Max
       -1297.47 -346.19             -66.53      249.30     3776.27

       Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
       (Intercept) -1316.73      90.82 -14.50    <2e-16 ***
       carat         6645.02    131.83   50.41   <2e-16 ***
       ---
       Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

       Residual standard error: 640.3 on 306 degrees of freedom
       Multiple R-Squared: 0.8925,     Adjusted R-squared: 0.8922
       F-statistic: 2541 on 1 and 306 DF, p-value: < 2.2e-16

       >   save.it<-examine.mod.simple(mod.fit.obj = mod.fit, const.var.test = TRUE,
                    boxcox.find = TRUE)
       >   save.it$lambda.hat
       [1] 0.33
       >   save.it$levene
       Levene's Test for Homogeneity of Variance
              Df F value    Pr(>F)
       group    1 26.852 3.995e-07 ***
             306
       Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

       >   max(abs(save.it$semi.stud.resid))
       [1] 5.9


                                                                                                                          3
                                                Response vs. predictor                                                  Residuals vs. predictor                                              Residuals vs. observation number                                                      Histogram of semistud. residuals
                    2000 4000 6000 8000




                                                                                                          3000




                                                                                                                                                                                     3000




                                                                                                                                                                                                                                                         0.4
Response variable




                                                                                                                                                                                                                                                         0.3
                                                                                    Residuals




                                                                                                                                                               Residuals




                                                                                                                                                                                                                                        Density
                                                                                                          1000




                                                                                                                                                                                     1000




                                                                                                                                                                                                                                                         0.2
                                                                                                          0




                                                                                                                                                                                     0




                                                                                                                                                                                                                                                         0.1
                                                                                                          -1000




                                                                                                                                                                                     -1000




                                                                                                                                                                                                                                                         0.0
                                          0.2        0.4    0.6      0.8      1.0                                 0.2     0.4       0.6      0.8      1.0                                     0    50    100 150      200   250 300                                                     -2        0       2        4   6

                                                       Predictor variable                                                      Predictor variable                                                       Observation number                                                                   Semistud. residuals


                                                                                                                   *
                    Residuals vs. estimated mean response                                                         ei vs. estimated mean response                                                        Normal Q-Q Plot                                                              Box-Cox transformation plot
                                                                                                                                                                                                                                                                                    95%
                                                                                                          6




                                                                                                                                                                                     6
                                                                                                                                                    116 131




                                                                                                                                                                                                                                                         -3400 -3200 -3000 -2800
                    3000




                                                                                                                                                      279
                                                                                    Semistud. residuals




                                                                                                                                                               Semistud. residuals
                                                                                                          4



                                                                                                                                                    278 294




                                                                                                                                                                                     4
                                                                                                                                            110




                                                                                                                                                                                                                                        log-Likelihood
Residuals




                                                                                                          2
                    1000




                                                                                                                                                                                     2
                                                                                                          0
                    0




                                                                                                                                                                                     0
                    -1000




                                                                                                          -2




                                                                                                                                                                                     -2
                                          0     1000         3000           5000                                  0     1000         3000           5000                                      -3   -2     -1     0      1       2   3                                              -2        -1       0        1       2

                                                    Estimated mean response                                              Estimated mean response                                                        Theoretical Quantiles




                                                >          ##############################
                                                >          #STEP #2
                                                >
                                                >          mod.fit2<-lm(formula = I(price^(1/3)) ~ carat, data = diamonds)
                                                >          summary(mod.fit2)

                                                Call:
                                                lm(formula = I(price^(1/3)) ~ carat, data = diamonds)

                                                Residuals:
                                                     Min       1Q                                             Median                           3Q                 Max
                                                -1.59498 -0.44873                                            0.01354                      0.38392             2.95759

                                                Coefficients:
                                                            Estimate Std. Error t value Pr(>|t|)
                                                (Intercept)    5.7922    0.1078   53.75   <2e-16 ***
                                                carat         12.0614    0.1564   77.11   <2e-16 ***
                                                ---
                                                Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                                                Residual standard error: 0.7597 on 306 degrees of freedom
                                                Multiple R-Squared: 0.951,      Adjusted R-squared: 0.9509
                                                F-statistic: 5945 on 1 and 306 DF, p-value: < 2.2e-16

                                                >   save.it2<-examine.mod.simple(mod.fit.obj = mod.fit2, const.var.test = TRUE,
                                                                                 boxcox.find = TRUE)
                                                >   save.it2$lambda.hat
                                                [1] 0.98

                                                >   save.it2$levene
                                                Levene's Test for Homogeneity of Variance
                                                       Df F value   Pr(>F)
                                                group   1   17.31 4.13e-05 ***
                                                      306
                                                                                                                                                                                                                                                                                                                           4
                                              ---
                                              Signif. codes:                         0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                                              >   max(abs(save.it2$semi.stud.resid))
                                              [1] 3.89
                                              Response vs. predictor                                                          Residuals vs. predictor                                            Residuals vs. observation number                                    Histogram of semistud. residuals




                                                                                                                                                                                                                                                             0.5
                                                                                                           3




                                                                                                                                                                                            3
                    10 12 14 16 18 20




                                                                                                                                                                                                                                                             0.4
Response variable




                                                                                                           2




                                                                                                                                                                                            2
                                                                                     Residuals




                                                                                                                                                                      Residuals




                                                                                                                                                                                                                                                             0.3
                                                                                                                                                                                                                                            Density
                                                                                                           1




                                                                                                                                                                                            1




                                                                                                                                                                                                                                                             0.2
                                                                                                           0




                                                                                                                                                                                            0




                                                                                                                                                                                                                                                             0.1
                                                                                                           -1




                                                                                                                                                                                            -1
                    8




                                                                                                                                                                                                                                                             0.0
                                        0.2        0.4        0.6        0.8   1.0                                    0.2       0.4        0.6        0.8    1.0                                  0    50    100 150      200   250 300                                   -2   -1   0       1   2       3   4

                                                       Predictor variable                                                           Predictor variable                                                      Observation number                                                  Semistud. residuals


                                                                                                                          *
                    Residuals vs. estimated mean response                                                             ei vs. estimated mean response                                                        Normal Q-Q Plot                                            Box-Cox transformation plot
                                                                                                                                                                                                                                                                      95%
                                                                                                           4




                                                                                                                                                                                            4
                    3




                                                                                                                                                                131
                                                                                                                                                  110        116
                                                                                                           3




                                                                                                                                                                                                                                                             -850
                                                                                                                                                                                            3
                                                                                     Semistud. residuals




                                                                                                                                                                      Semistud. residuals
                    2




                                                                                                           2




                                                                                                                                                                                                                                            log-Likelihood
                                                                                                                                                                                            2
Residuals




                                                                                                           1
                    1




                                                                                                                                                                                            1




                                                                                                                                                                                                                                                             -950
                                                                                                           0
                    0




                                                                                                                                                                                            0
                                                                                                           -3 -2 -1




                                                                                                                                                                                            -1
                    -1




                                                                                                                                                                                                                                                             -1050
                                                                                                                                                                                            -2


                                        8         10     12         14    16   18                                     8        10     12         14     16   18                                   -3   -2     -1     0      1       2   3                            -2        -1       0           1       2

                                                  Estimated mean response                                                      Estimated mean response                                                      Theoretical Quantiles




                                              >           #############################
                                              >           #STEP #3

                                              >           mod.fit3<-lm(formula = I(price^(1/3)) ~ carat + I(carat^2), data = diamonds)
                                              >           summary(mod.fit3)

                                              Call:
                                              lm(formula = I(price^(1/3)) ~ carat + I(carat^2), data = diamonds)

                                              Residuals:
                                                    Min        1Q                                                        Median                               3Q                                  Max
                                              -1.696707 -0.427723                                                      0.002276                         0.397153                             3.161320

                                              Coefficients:
                                                          Estimate Std. Error t value Pr(>|t|)
                                              (Intercept)    4.7856    0.2276   21.03 < 2e-16 ***
                                              carat         15.9938    0.8054   19.86 < 2e-16 ***
                                              I(carat^2)    -3.1064    0.6250   -4.97 1.12e-06 ***
                                              ---
                                              Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                                              Residual standard error: 0.7319 on 305 degrees of freedom
                                              Multiple R-Squared: 0.9547,     Adjusted R-squared: 0.9544
                                              F-statistic: 3215 on 2 and 305 DF, p-value: < 2.2e-16

                                              >  #note that the response vs. predictor variable plot will not be meaningful here.
                                              >   save.it3<-examine.mod.simple(mod.fit.obj = mod.fit3, const.var.test = TRUE,
                                                                               boxcox.find = TRUE)
                                              >   save.it3$lambda.hat
                                              [1] -0.08


                                                                                                                                                                                                                                                                                                                5
                                              >   save.it3$levene
                                              Levene's Test for Homogeneity of Variance
                                                     Df F value    Pr(>F)
                                              group   1 17.075 4.644e-05 ***
                                                    306
                                              ---
                                              Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                                              >   max(abs(save.it3$semi.stud.resid))
                                              [1] 4.32

                                              Note that the response vs. predictor variable plot is not meaningful now since it is not plotting
                                              the model with b2carat2
                                              Response vs. predictor                                                    Residuals vs. predictor                                                  Residuals vs. observation number                                   Histogram of semistud. residuals
                                                                                                             3




                                                                                                                                                                                            3
                    10 12 14 16 18 20




                                                                                                                                                                                                                                                             0.4
Response variable




                                                                                                             2




                                                                                                                                                                                            2




                                                                                                                                                                                                                                                             0.3
                                                                                       Residuals




                                                                                                                                                                      Residuals




                                                                                                                                                                                                                                            Density
                                                                                                             1




                                                                                                                                                                                            1




                                                                                                                                                                                                                                                             0.2
                                                                                                             0




                                                                                                                                                                                            0




                                                                                                                                                                                                                                                             0.1
                                                                                                             -1




                                                                                                                                                                                            -1
                    8




                                                                                                                                                                                                                                                             0.0
                                        0.2     0.4       0.6        0.8        1.0                               0.2     0.4       0.6        0.8        1.0                                     0    50    100 150      200   250 300                                  -2   -1    0   1    2       3   4

                                                     Predictor variable                                                        Predictor variable                                                           Observation number                                                 Semistud. residuals


                                                                                                                   *
                    Residuals vs. estimated mean response                                                         ei vs. estimated mean response                                                            Normal Q-Q Plot                                           Box-Cox transformation plot
                                                                                                                                                                                                                                                                     95%
                                                                                                                                                      116       131
                    3




                                                                                                             4




                                                                                                                                                                                            4




                                                                                                                                          110
                                                                                                                                                            279




                                                                                                                                                                                                                                                             -780
                                                                                       Semistud. residuals




                                                                                                                                                                      Semistud. residuals

                                                                                                                                                                                            3
                    2




                                                                                                                                                          278




                                                                                                                                                                                                                                            log-Likelihood
                                                                                                             2




                                                                                                                                                                                            2
Residuals

                    1




                                                                                                                                                                                            1




                                                                                                                                                                                                                                                             -820
                                                                                                             0
                    0




                                                                                                                                                                                            0
                                                                                                                                                                                            -1
                    -1




                                                                                                             -2




                                                                                                                                                                                                                                                             -860
                                                                                                                                                                                            -2




                                         8      10      12      14         16     18                               8      10      12      14         16     18                                    -3   -2     -1     0      1       2   3                           -2         -1       0        1           2

                                               Estimated mean response                                                   Estimated mean response                                                            Theoretical Quantiles




                                                                                                                                                                                                                                                                                                                 6
Suppose a carat2 is added first to the model instead.

           Residual
Step       Analysis                                         Comments                                         Action to take
        Linearity of the
                            Examined a plot of the residuals vs. carat and it shows a pattern of more
          regression
                                        variability in the residuals as carat increases.
           function
                            Examined a plot of the residuals vs. predicted values and there is a funnel
        Constant error     pattern. Also, Levene test rejects constant variance. The Box-Cox method
          variance                                                               ˆ
                                               suggests a log() transformation (  = -0.03).
 2                                                                                                            Use log(price)
                            Examined QQ-plot and histogram of the semi-studentized residuals. The
       Normality of the          QQ-plot shows strong deviations from normality on the tails of the
             i            distribution (right side is definitely showing large deviations). The histogram
                                      is giving similar evidence, but not necessarily as obvious.
                           Examined a plot of the semi-studentized residuals vs. predicted values and
            Outliers
                                            it shows 7 or more outliers. The largest is 6.03
        Linearity of the
                            Examined a plot of the residuals vs. carat and it is a random scattering of
          regression
                                                             points.
           function
                            Examined a plot of the residuals vs. predicted values and there is no funnel
        Constant error
                              pattern. Levene test does not reject constant variance. The Box-Cox             Examine the 3
          variance
 3                          method suggests no transformation is needed with a C.I. for  containing 1.      possible outliers
                            Examined QQ-plot and histogram of the semi-studentized residuals. Both             more closely
       Normality of the
                           look much better than before. The QQ-plot is showing a little deviation from
             i
                                                      normality on the right tail.
                            Examined a plot of the semi-studentized residuals vs. predicted values and
            Outliers
                                                 it shows 3 points a little above 3.



     Step #2 code and output (note that the response vs. predictor variable plot is not meaningful now
     since it is not plotting the model with b2carat2)

        >   mod.fit2<-lm(formula = price ~ carat + I(carat^2), data = diamonds)
        >   summary(mod.fit2)

        Call:
        lm(formula = price ~ carat + I(carat^2), data = diamonds)

        Residuals:
              Min       1Q              Median            3Q           Max
        -1236.634 -243.633              -6.749       162.173      3514.703

        Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
        (Intercept)    -24.35    181.25 -0.134    0.8932
        carat         1596.16    641.43   2.488   0.0134 *
        I(carat^2)    3988.38    497.75   8.013 2.4e-14 ***
        ---
        Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        Residual standard error: 582.9 on 305 degrees of freedom
        Multiple R-Squared: 0.9112,     Adjusted R-squared: 0.9106
        F-statistic: 1565 on 2 and 305 DF, p-value: < 2.2e-16

        >   save.it2<-examine.mod.simple(mod.fit.obj = mod.fit2, const.var.test = TRUE,
                                         boxcox.find = TRUE)
        >   save.it2$lambda.hat
        [1] -0.03

                                                                                                                               7
                                           >   save.it2$levene
                                           Levene's Test for Homogeneity of Variance
                                                  Df F value    Pr(>F)
                                           group   1 63.656 2.996e-14 ***
                                                 306
                                           ---
                                           Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                                           >         save.it2$bp

                                                                Breusch-Pagan test

                                           data: Y ~ X
                                           BP = 100.9756, df = 1, p-value < 2.2e-16

                                           >   max(abs(save.it2$semi.stud.resid))
                                           [1] 6.03
                                                 Response vs. predictor                                               Residuals vs. predictor                                         Residuals vs. observation number                                    Histogram of semistud. residuals
                                                                                                        3000




                                                                                                                                                                                                                                                  0.4
                                                                                                                                                                              3000
                     2000 4000 6000 8000
Response variable




                                                                                                                                                                                                                                                  0.3
                                                                                  Residuals




                                                                                                                                                        Residuals




                                                                                                                                                                                                                                 Density
                                                                                                        1000




                                                                                                                                                                              1000




                                                                                                                                                                                                                                                  0.2
                                                                                                        0




                                                                                                                                                                              0




                                                                                                                                                                                                                                                  0.1
                                                                                                        -1000




                                                                                                                                                                              -1000




                                                                                                                                                                                                                                                  0.0
                                           0.2     0.4    0.6      0.8      1.0                                 0.2     0.4    0.6      0.8      1.0                                   0    50    100 150      200   250 300                                   -2        0   2      4       6

                                                     Predictor variable                                                   Predictor variable                                                     Observation number                                                 Semistud. residuals


                                                                                                                 *
                    Residuals vs. estimated mean response                                                       ei vs. estimated mean response                                                   Normal Q-Q Plot                                            Box-Cox transformation plot
                                                                                                                                                                                                                                                           95%




                                                                                                                                                                                                                                                  -2700
                                                                                                                                                  131
                                                                                                        6




                                                                                                                                                                              6
                     3000




                                                                                                                                               116
                                                                                                                                 110      120 279
                                                                                  Semistud. residuals




                                                                                                                                                        Semistud. residuals
                                                                                                        4




                                                                                                                                               294
                                                                                                                                                                              4




                                                                                                                                                                                                                                                  -2900
                                                                                                                                            278




                                                                                                                                                                                                                                 log-Likelihood
Residuals




                                                                                                        2
                     1000




                                                                                                                                                                              2




                                                                                                                                                                                                                                                  -3100
                                                                                                        0
                     0




                                                                                                                                                                              0
                                                                                                        -2
                     -1000




                                                                                                                                                                                                                                                  -3300
                                                                                                                                                                              -2




                                            1000         3000        5000                                        1000         3000       5000                                          -3   -2     -1     0      1       2   3                            -2        -1       0          1       2

                                                  Estimated mean response                                              Estimated mean response                                                   Theoretical Quantiles




                    Step #3 code and output (note that the response vs. predictor variable plot is not meaningful now
                    since it is not plotting the model with b2carat2)

                                           >       mod.fit3<-lm(formula = log(price) ~ carat + I(carat^2), data = diamonds)
                                           >       summary(mod.fit3)

                                           Call:
                                           lm(formula = log(price) ~ carat + I(carat^2), data = diamonds)

                                           Residuals:
                                                 Min        1Q    Median                                                                     3Q                                    Max
                                           -0.451872 -0.088577 -0.004409                                                               0.096854                               0.500447

                                           Coefficients:
                                                       Estimate Std. Error t value Pr(>|t|)
                                           (Intercept)    5.2235    0.0483 108.15    <2e-16 ***
                                           carat          5.4368    0.1709   31.81   <2e-16 ***
                                           I(carat^2)    -2.0501    0.1326 -15.46    <2e-16 ***
                                           ---
                                           Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
                                                                                                                                                                                                                                                                                                    8
                                      Residual standard error: 0.1553 on 305 degrees of freedom
                                      Multiple R-Squared: 0.9639,     Adjusted R-squared: 0.9636
                                      F-statistic: 4066 on 2 and 305 DF, p-value: < 2.2e-16

                                      >   save.it3<-examine.mod.simple(mod.fit.obj = mod.fit3, const.var.test = TRUE,
                                                                       boxcox.find = TRUE)
                                      >   save.it3$lambda.hat
                                      [1] 0.79

                                      >   save.it3$levene
                                      Levene's Test for Homogeneity of Variance
                                             Df F value Pr(>F)
                                      group   1   0.001 0.975
                                            306

                                      >             save.it3$bp

                                                                Breusch-Pagan test

                                      data: Y ~ X
                                      BP = 0.1054, df = 1, p-value = 0.7455

                                      >   max(abs(save.it3$semi.stud.resid))
                                      [1] 3.22

                                      Response vs. predictor                                                         Residuals vs. predictor                                                        Residuals vs. observation number                                   Histogram of semistud. residuals




                                                                                                                                                                                                                                                                0.5
                    9.0




                                                                                                   0.4




                                                                                                                                                                                        0.4




                                                                                                                                                                                                                                                                0.4
Response variable




                                                                                                   0.2




                                                                                                                                                                                        0.2
                    8.0




                                                                             Residuals




                                                                                                                                                                  Residuals




                                                                                                                                                                                                                                                                0.3
                                                                                                                                                                                                                                               Density
                                                                                                   0.0




                                                                                                                                                                                        0.0




                                                                                                                                                                                                                                                                0.2
                    7.0




                                                                                                   -0.4 -0.2




                                                                                                                                                                                        -0.4 -0.2




                                                                                                                                                                                                                                                                0.1
                    6.0




                                                                                                                                                                                                                                                                0.0
                                0.2         0.4         0.6     0.8    1.0                                     0.2         0.4         0.6     0.8         1.0                                       0    50    100 150      200   250 300                             -3     -2        -1   0       1   2   3

                                              Predictor variable                                                             Predictor variable                                                                Observation number                                                  Semistud. residuals


                                                                                                                *
                    Residuals vs. estimated mean response                                                      ei vs. estimated mean response                                                                  Normal Q-Q Plot                                           Box-Cox transformation plot
                                                                                                                                                            131
                                                                                                                                                                                                                                                                        95%
                                                                                                                                                                                                                                                                -310
                                                                                                   3




                                                                                                                                                                                        3
                    0.4




                                                                                                                                                     110    116
                                                                             Semistud. residuals




                                                                                                                                                                  Semistud. residuals
                                                                                                   2




                                                                                                                                                                                        2
                    0.2




                                                                                                                                                                                                                                                                -320
                                                                                                                                                                                                                                               log-Likelihood
                                                                                                   1
Residuals




                                                                                                                                                                                        1
                    0.0




                                                                                                   0




                                                                                                                                                                                        0




                                                                                                                                                                                                                                                                -330
                                                                                                   -1




                                                                                                                                                                                        -1
                    -0.4 -0.2




                                                                                                                                                                                                                                                                -340
                                                                                                   -2




                                                                                                                                                                                        -2
                                                                                                   -3




                                                                                                                                                                                        -3




                                      6.5         7.0     7.5    8.0   8.5                                           6.5         7.0     7.5    8.0        8.5                                       -3   -2     -1     0      1       2   3                           -2          -1            0       1       2

                                          Estimated mean response                                                      Estimated mean response                                                                 Theoretical Quantiles




                                      Suppose a log() transformation of price is done first instead. Without going through all of the R
                                      code and output here, the next step suggested by the residual analysis would be to add carat 2
                                      to the model resulting in the same model as above.

                         c) (3 points) State the sample model obtained in step 3.

                                      For the rest of this problem, I will be working with the model obtained from the second and
                                      third method in my part b).

                                                                                                                                                                                                                                                                                                                     9
   log(price)  5.2235  5.4368carat  2.0501carat2

d) Suppose you have chosen the correct regression model and all of the model’s assumptions
   are satisfied! Using a standard normal distribution approximation for the semi-studentized
   residuals, approximately how many observations would you expect to detect as outliers?
   Remember that an outlier is roughly defined as a semi-studentized residual outside of the –3
   and +3 bounds. Note that you do not need anything from the data set other than n=308 to
   answer this question.

   Let Z be a standard normal random variable. Then P(Z<-3 or Z>3) = 0.0027. Using the
   binomial distribution, let a “success” be defined as a semi-studentized residual outside of the
   -3 and +3 bounds. The mean for a binomial random variable is =n. In this case, n=308,
   =0.0027, and =3080.0027 = 0.8316. Thus, one would expect to find about 1 semi-
   studentized residual outside of the -3 and +3 bounds.

   One could go further to answer the question. Using the rule of thumb for the number of
   standard deviations all data lies from its mean, a binomial random variable should be observed
   between 3 where = n(1 ) . Thus, one would expect the number of semi-studentized
   residuals outside of -3 and +3 to be 0.83163 308  0.0027(1 0.0027) = 0.8316  2.7321 =
   (-1.90, 3.56). Notice that there were 3 semi-studentized residuals outside of the –3 and +3
   bounds. Thus, this is a likely occurrence and we should not be alarmed by seeing these
   outliers.

e) How well is the model doing in predicting price? Explain your answer using R 2.

   R2 = 0.9639 from output in part b).

   The model is doing a very good job at predicting price since the R2 is close to 1.

f) Suppose two people are about to get engaged and they plan to purchase a diamond
   engagement ring. You offer your statistical expertise to aid them in pricing diamonds.
   Suppose the couple is interested in a 0.5 carat diamond. Using the resulting model stated in
   c), what is the estimated price of a 0.5 carat diamond? Use a point estimate of price to answer
   the question. How to calculate a prediction interval will be discussed later in the course.

   The sample regression model is log(price) = 5.2235 + 5.4368Carat – 2.0502Carat2. Thus, the
   estimate for price is

   exp(5.2235 + 5.4368Carat – 2.0502Carat2) = exp[5.2235 + 5.4368(0.5) – 2.0502(0.5)2]
         = $1,684.79

   >   exp(predict(object = mod.fit3, newdata = data.frame(carat = 0.5)))
   [1] 1684.787




                                                                                                     10
11

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:0
posted:7/9/2011
language:English
pages:11