# Regression by 3t6txH

VIEWS: 21 PAGES: 78

• pg 1
```									 Regression
Lecture 9

http://galton.org/
Aims for Today - Regression

0.   Repeated measures in SPSS
1.   Drawing lines on scatter plots
2.   The regression line: Predicting values
3.   Correlation
4.   Ranked based correlation
5.   Break/Handout
6.   How tos
7.   Examples by Dan
Chile and maybe being hit by a car
8.   Group R functions
How To
How to impute missing data is complex.
More next term.
Scatter Plot

• Plotting 2 continuous-ish variables
• Exploring their association
• One of the most used and most
useful techniques in science.
40
2. Decide the scale and draw the y axis
Estimated velocity (mph)
30         3. Add all the points

4. Label axes
10      20

1. Decide the scale and draw the x axis
0

0          10       20        30                 40
Actual velocity (mph)
Several ways to
make in SPSS.
Default shows what appears to be a negative relationship,
but the graphs can be improved.
Graphing 3 Variables (London et al., 2007)

4- to 9-year olds
2 week recall
10 month recall
Can you see the 8s?
Lines and Equations

200
150
Degrees Fahrenheit

F = 32 + 9/5 C
100
50

32 is the intercept
9/5 (or 1.8) is the slope
0

-20   0      20    40       60      80        100
Degrees Celsius
Other graph formats
70
60
Height in inches
50
40

Is this the right approach?
30
20

5     10      15   20    25   30    35
Age in years
Fitting a straight line (a linear relationship)
Heighti   0   1 Agei  ei
Finding the Regression Line

•   Very general procedure (easily expanded)
•   Simple linear regression
•   Easiest way is just to draw a straight line yourself
•   A more formal method has some value

Heighti   0   1 Agei  ei
and finding the β0 and β1 which minimize Σei2

• Least Squares is also used in t test and mean
(least absolute value is used for the median)
yi   19.25 1.92xi e i

20
15
y variable
10
5

Residuals shown with
veritical dashed lines
0

0        2        4          6   8   10
x variable
More on teaching all of you about fashion
minimizing the squared residuals: min Σei2

Is least squares regression
better than “eyeballing” it?

Are there better formal methods?
Do you need to know the equations for β0 and β1?

Not really

Would they be worth seeing once?

Probably

ˆ
1   ( x  x )( y  y )
i           i
just look at,
don't write
 (x  x)   i
2

ˆ         ˆ
 0  y   1x

ˆ
 means an estimate of 
Regressions sometimes used to predict values
1600

1400
from Plasma in n/mols

(data based on
1200
Tytherleigh, 2002)

1000

800

600

200     400      600      800
from Saliva in n/mols
Running a regression in R

lm is for Linear Model
plasmai  512  1.01 salivai  ei

t (18)  10.24, p  .001

t  F (1,18)  105, r  .85, r  .85
2                    2         2
and r or R

(1  r 2 )(n  1)
2
n  k 1

(1  .8536)(20  1)
r2
 1                      .8455
20  1  1

Shrunken estimate like ω2
1201248
r  
2   2
 .85
1201248 206009
Assessing the Fit: The Correlation
Equation bit

r        ( x  x )( y  y )
i        i

 (x  x )  ( y  y)
i
2
i
2

• Top part determines whether positive or negative. If xi and yi are
same side as their means, positive, otherwise negative.
• If as one goes up, the other goes up, positive.
Correlation: Strength of the linear relationship

• Can get to it in several ways.
• The correlation squared in the proportion of shared variance.
• The correlation can range only from -1 to +1.

• Does a correlation between x and y mean x caused y?
• Does a correlation between x and y mean that there is some causal
relationship in the network of hypotheses that include x and y?
• Are the most parsimonious ones x -> y and y -> x?
Significance Testing

H0: ρ (rho) = 0
Almost always use two tailed tests

You must know the sample size
r = 0.1 is significant with n=500 at 5%
r = 0.4 is not significant with n=20 at 5%

(Cohen sizes: .1 small, .3 medium, .5 large)
Significance Testing and Confidence Intervals
The equations

r n2              r 2 (n  2)
t                  , F
1 r2              1 r2
with df = n - 2, and df = 1, n - 2

1 r 
r '  .5 ln      
1 r 
1
95 %CI r '  r '1.96
n3
e2r '  1
r  2r '
e 1
Making Confidence Intervals
Several programs on web. http://faculty.vassar.edu/lowry/rho.html
Notice the Normal and Basic
Bootstraps give impossible
upper bounds.

BCa very similar to asymptotic methods
r = .03 (p=.82)
r = .38 (p = .01)
r = .63 (p < .001)
r = .95 (p < .001)
influential
outlier

r=.64 (p=.05), w/o outlier r=.92, w/o infl. r=.38
r = .64, w/o outlier r = .92, w/o influential point r = .38
Assumptions for Significance

• Random sampling
• It must make sense to talk about the response variable (the “DV”) as
being continuous.
• No weird patterns (or non-linear in general) in residuals. Variance of
residual homoscedastic (ie., not varying by other variables -
heteroscedastic)
• Examination of outliers
What to do if assumptions not meet
(to get data, install and load mrt. data(crime) and attach)

Scatterplot on Raw data                                     Scatterplot on logs of data

6
Regency                                                    Regency
300

5
250
Drug offences

4
ln(Drugs + 1)
200

3
150

2
100

1
50
0

0

0    1000       2000      3000                               2     3     4     5     6    7     8

Theft offences                                                ln(Thefts + 1)
Ranked based Correlation

• Spearman's rho
• Rank the data and use Pearson's + stuff for ties.
• r = .94 and Spearman's rS = .78.
In SPSS and R just tick a box or change the method

Doesn't print confidence interval
• Same correlation estimate.
• But the CI really does meet appropriate assumptions.
Break Time

• Short break.

In 4 groups (mix from different programs)

• Look at the handout that I am about to give you. Discuss how you
would report your findings in a scientific journal versus People
magazine. Are there any other statistics you would want to do?
"Since it is for looking at differences among means, why is it called an
Analysis of Variance?"
HOW TOs
Some Examples

• Chile Heat: To discuss re-expression and what to do with outliers.
• Automobile Accidents: To discuss using theory to guide your statistics.

+       +
+       +
+       +
+       +
+       +
+       +
+       +
+       +
+       +
+       +
+       +

Episode 425 - 2005 Nov. 9, 2008, "Dangerous Curves"
Are smaller chiles hotter?

• How to measure length and heat.
• Length skewed

Raw data                                  Transformed data
30

15
Frequency

Frequency
20

10
10

5
0

0

0 5      15        25   35                     1.5       2.5         3.5
Length in cm                            ln (Length in cm + 2.54)
Testing Normality
par(mfrow=c(1,2))
qqnorm(LENGTH); qqline(LENGTH)
qqnorm(log(LENGTH+2.54));qqline(log(LENGTH+2.54))
par(mfrow=c(1,1))

Normal Q-Q Plot                                     Normal Q-Q Plot

3.5
Sample Quantiles

Sample Quantiles
25

2.5
15

1.5
5

-2    -1   0     1     2                            -2    -1   0     1     2
Theoretical Quantiles                               Theoretical Quantiles
Measuring Heat: Scoville units or the number of chiles?
Raw data

lm(HEAT[LENGTH<30]~LENGTH[LENGTH<30])

10
lm(HEAT~LENGTH)

8
Nu Mex
Heat in PJs

Big Jim
6
4
2
0

5     10      15       20       25      30
Length in cm
Transformed data

lm(HEAT[lnlength<3.4]~lnlength[lnlength<3.4])

10
lm(HEAT~lnlength)

8
Nu Mex
Heat in PJs

Big Jim
6
4
2
0

1.5   2.0          2.5           3.0           3.5
Transformed length
Command Summary

r1 <- lm(HEAT~LENGTH)
r2 <- lm(HEAT[LENGTH<30]~LENGTH[LENGTH<30])
r3 <- lm(HEAT ~ log(LENGTH + 2.54))
Residuals vs Fitted                                                  Normal Q-Q
plot(r1)

Standardized residuals

2
43                                                                                43

4
2

1
Residuals

0

0
-4 -2

-1
-2
2552
52 25

Nu Mex is hotter                                       3       4      5      6                                    -2        -1      0       1      2

than predicted for                                           Fitted values                                             Theoretical Quantiles

its length
Scale-Location                                         Residuals vs Leverage
1.5
Standardized residuals

2552

Standardized residuals
43

2
72   0.5

1
1.0

0
0.5

-1
75

-2
0.5
25   Cook's distance
0.0

3       4      5      6                                 0.00      0.05       0.10        0.15

Fitted values                                                       Leverage
What to do with

Genetically
engineered.

Depends on the
population and
purpose.
What is a "linear model"

 1          1      ...   1 
 x1        x12     ... x1n 
[ y1   y2   ... yn ]   0  1  2  3  1                              [e e ... e ]
 x 21      x22     ... x 2n 
1 2     n

                              
 x1x 21   x1x 22   ... x1x 2n 

Y = βX + e

Don't worry if you dislike matrix notation
Vehicle-Pedestrian Accidents

• What is the relationship between the impact velocity of a vehicle and
the throw of a pedestrian?
• A lot is known about how a body should move when hit by a car at a
certain velocity.
• Good reason to suggest: throwi = k vi2 + ei

Dan will glance around to see if anyone looks interested in "why" this
equation makes sense, and may skip the next two slides.
Why Theoretical Sense? (frictionless)

Body takes on impact horizontal
velocity of the car, v, at an angle
above the horizontal.

Vertical velocity vy = v sin θ
Horizontal velocity vx = v cos θ

Time in air, t, is related only to vy. t = 2 vy / g, where g is the constant for
Without friction, vx is constant and thus throw should be:

2 vyi          2 v sin     2 cos  sin 
throwi  vxi        v cos           v
g                g               g
and if θ is the same for all cars: throw = v2 k, where k is a constant.

Thus, throwi = k vi2 + ei

Simpler: Only 1 unknown (k) to solve for AND it has some empirical meaning
Wood, Simms & Walsh (2005)
120

100

80     r = 0.82
Distance (m)

60

40

20

0
0    20    40     60    80   100   120
Velocity (kmph)

Otte's work with crash test dummies
reglin <- lm(distance ~ speed)
regpoly <- lm(distance ~ speed + spsq)
regmodel <- lm(distance ~ spsq - 1)
100

This can done in
Distance in yards
80

SPSS too. Tick no
60

intercept/constant.
40
20
0

0   50         100      150
Speed in mph
Summary

is p sig?     is r     is it   does the plot   does it make
large?   robust?    look right?      sense?

--------------
This week's journal

1. Try help(par)
2. Write an equation in Word
3. Access these data from web:
http://www2.fiu.edu/~dwright/qm4psych/fishstock.dat
http://www2.fiu.edu/~dwright/qm4psych/fishstock.sav

Variables (from EU Fishery Commission, 2007) are:
ocean (how much winter low temperature is above freezing in Celsius) and
fishstock (> 2cm in thousands per cubic kilometer).
What are the correlation and the regression equation?
Write a sentence about the results.
According to these data, who may have Luck next year?
Group Functions

•   Which groups (anybody program before?).
•   I will help with functions.
•   Think what would you like to be able to do.
•   Think how, in words, you could do it.
•   Drawing diagrams sometimes helps.
Better graphics