Basic principles of probability theory - PowerPoint

Document Sample
Basic principles of probability theory - PowerPoint Powered By Docstoc
					                         Regression II
                      Anaysis of diagnostics

•   Standard diagnostics
•   Bootstrap
•   Cross-validation
                              Standard diagnostics
Before starting to model
1)   Visualisation of data:
    1)   plotting predictor vs observations. These plots may give a clue about the relationship,
    2)   Smootheners
After modelling and fitting
2)    Fitted values vs residuals. It may help to identify outliers, correctness of the
3)    Normal QQ plot of residuals. It may help to check distribution assumptions
4)    Cook’s distance. Reveal outliers, check correctness of the model
5)    Model assumptions - t tests given by default print of lm
Checking model and designing tests
3)    Cross-validation. If you have a choice of models then cross-validation may help
      to choose the “best” model
4)    Bootstrap. Validity of the model can be checked if the distribution of statistic of
      interest is available. Or these distributions could be generated using bootstrap
                     Visualisation prior to modeling
Different type of datasets may require different visualisation tools. For simple
      visualisation either plot(data) or pairs(data,panel=panel.smooth) could be used.
      Visualisation prior to modeling may help to propose model (form of the
      functional relationship between input and output, probability distribution of
      observation etc)
For example for dataset women - where weights and heights for 15 cases have been
      measured. plot and pairs commands produce these plots:
                       After modeling: linear models
After modelling the results should be analysed. For example

lm1 = lm(weight~height)

It means that we want a liner model (we believe that dependence of weight on height is
Results could be viewed using
The last command will produce significant of various coefficients also. Significance
      levels produced by summary should be considered carefully. If there are many
      coefficients then the chance that one “significant” effect is observed is very high.
                        After modeling: linear models
It is a good idea to plot data and fitted model, and differences between fitted and
      observed values on the same graph. For linear models with one predictor it can be
      done using:

This plot already shows some systematic
differences. It is an indication that model
may need to be revised.
           Checking validity of the model: standard tools
Plotting fitted values vs residual, QQ plot and Cook’s distance can give some insight
      into model and how to improve it. Some of these plots can be done using
 Prediction and confidence bands
lm1 = lm(height~weight))
pp = predict(lm1,interval='p')
pc = predict(lm1,interval='c')

These commands produce two sets
of bands: narrow and wide.
Narrow band corresponds to
confidence bands and wide band is
prediction band
                             Bootstrap confidence lines

Similarly bootstrap line can be calculated using
Functions boot_lm and flm0 are available from the course’s website
Most of the above indicators show that quadratic (quadratic on predictor, not on
   parameter) model may be better. One obvious way of “improving” the model
   is to assume that dependence of heights on weights is quadratic. It can be
   done within linear model also. We can fit polynomial on predictor model
height = 0+1*weight+2*weight2+…

We will use quadratic model:
lm2 = lm(height~weight+I(weight^2))
Again summary of lm2 should be viewed

Default plot now looks better
  Confidence bands using the following set of commands looks narrower

lm2 = lm(height~weight+I(weight^2))
pp = predict(lm2,interval='p')
pc = predict(lm2,interval='c')
Spread of bootstrap confidence lines also is much smaller also
                        Which model is better?
One of the ways of selecting model is cross-validation. There is no command in R for
  cross validation for lm models. However there is a command for glm (generalised
  linear model. It is the subject of the next lecture. For now we need only to know
  that lm and glm with family=‘gaussian’ are the same). Let us use default leave one
  out cross-validation

lm1g = glm(height~weight,women,family=‘gaussian’)
cv1.err = cv.glm(women,lm1g)
Results: 0.2572698 0.2538942
women1 = data.frame(h=height,w1=weight,w2=weight^2)
Lm2g = glm(h~w1+w2,data=women1,family=‘gaussian’)
cv2.err = cv.glm(women1,lm2g)
Results: 0.007272508, 0.007148601

The second has smaller prediction error.
1.   Stuart, A., Ord, KJ, Arnold, S (1999) Kendall’s advanced theory of
     statistics, Volume 2A
2.   Box, GEP, Hunter, WG and Hunter, JS (1978) Statistics for experimenters
3.   Berthold, MJ and Hand, DJ. Intelligent Data Analysis
4.   Dalgaard, Introductury statistics with R
                             Exercise 3

Take data set city and analyse it as linear model. Write a report.

Shared By: