# Basic principles of probability theory - PowerPoint

Document Sample

```					                         Regression II
Anaysis of diagnostics

•   Standard diagnostics
•   Bootstrap
•   Cross-validation
Standard diagnostics
Before starting to model
1)   Visualisation of data:
1)   plotting predictor vs observations. These plots may give a clue about the relationship,
outliers
2)   Smootheners
After modelling and fitting
2)    Fitted values vs residuals. It may help to identify outliers, correctness of the
model
3)    Normal QQ plot of residuals. It may help to check distribution assumptions
4)    Cook’s distance. Reveal outliers, check correctness of the model
5)    Model assumptions - t tests given by default print of lm
Checking model and designing tests
3)    Cross-validation. If you have a choice of models then cross-validation may help
to choose the “best” model
4)    Bootstrap. Validity of the model can be checked if the distribution of statistic of
interest is available. Or these distributions could be generated using bootstrap
Visualisation prior to modeling
Different type of datasets may require different visualisation tools. For simple
visualisation either plot(data) or pairs(data,panel=panel.smooth) could be used.
Visualisation prior to modeling may help to propose model (form of the
functional relationship between input and output, probability distribution of
observation etc)
For example for dataset women - where weights and heights for 15 cases have been
measured. plot and pairs commands produce these plots:
After modeling: linear models
After modelling the results should be analysed. For example

attach(women)
lm1 = lm(weight~height)

It means that we want a liner model (we believe that dependence of weight on height is
linear)
weight=0+1*height
Results could be viewed using
lm1
summary(lm1)
The last command will produce significant of various coefficients also. Significance
levels produced by summary should be considered carefully. If there are many
coefficients then the chance that one “significant” effect is observed is very high.
After modeling: linear models
It is a good idea to plot data and fitted model, and differences between fitted and
observed values on the same graph. For linear models with one predictor it can be
done using:
plot(weight,height)
abline(lm1)
segements(weight,fitted(lm1),weight,height)

This plot already shows some systematic
differences. It is an indication that model
may need to be revised.
Checking validity of the model: standard tools
Plotting fitted values vs residual, QQ plot and Cook’s distance can give some insight
into model and how to improve it. Some of these plots can be done using
plot(lm1)
Prediction and confidence bands
lm1 = lm(height~weight))
pp = predict(lm1,interval='p')
pc = predict(lm1,interval='c')
plot(weight,height,ylim=range(height,pp))
n1=order(weight)
matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red')
matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red')

These commands produce two sets
of bands: narrow and wide.
Narrow band corresponds to
confidence bands and wide band is
prediction band
Bootstrap confidence lines

Similarly bootstrap line can be calculated using
boot_lm(women,flm0,1000)
Functions boot_lm and flm0 are available from the course’s website
Most of the above indicators show that quadratic (quadratic on predictor, not on
parameter) model may be better. One obvious way of “improving” the model
is to assume that dependence of heights on weights is quadratic. It can be
done within linear model also. We can fit polynomial on predictor model
height = 0+1*weight+2*weight2+…

lm2 = lm(height~weight+I(weight^2))
Again summary of lm2 should be viewed

Default plot now looks better
Confidence bands using the following set of commands looks narrower

lm2 = lm(height~weight+I(weight^2))
pp = predict(lm2,interval='p')
pc = predict(lm2,interval='c')
plot(weight,height,ylim=range(height,pp))
n1=order(weight)
matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red')
matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red')
Spread of bootstrap confidence lines also is much smaller also
Which model is better?
One of the ways of selecting model is cross-validation. There is no command in R for
cross validation for lm models. However there is a command for glm (generalised
linear model. It is the subject of the next lecture. For now we need only to know
that lm and glm with family=‘gaussian’ are the same). Let us use default leave one
out cross-validation

lm1g = glm(height~weight,women,family=‘gaussian’)
cv1.err = cv.glm(women,lm1g)
cv1.err\$delta
Results: 0.2572698 0.2538942
women1 = data.frame(h=height,w1=weight,w2=weight^2)
Lm2g = glm(h~w1+w2,data=women1,family=‘gaussian’)
cv2.err = cv.glm(women1,lm2g)
cv2.err\$delta
Results: 0.007272508, 0.007148601

The second has smaller prediction error.
References
1.   Stuart, A., Ord, KJ, Arnold, S (1999) Kendall’s advanced theory of
statistics, Volume 2A
2.   Box, GEP, Hunter, WG and Hunter, JS (1978) Statistics for experimenters
3.   Berthold, MJ and Hand, DJ. Intelligent Data Analysis
4.   Dalgaard, Introductury statistics with R
Exercise 3

Take data set city and analyse it as linear model. Write a report.

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 2 posted: 9/11/2012 language: Latin pages: 16