Regression by 3t6txH

VIEWS: 21 PAGES: 78

									 Regression
     Lecture 9




http://galton.org/
                   Aims for Today - Regression

0.   Repeated measures in SPSS
1.   Drawing lines on scatter plots
2.   The regression line: Predicting values
3.   Correlation
4.   Ranked based correlation
5.   Break/Handout
6.   How tos
7.   Examples by Dan
        Chile and maybe being hit by a car
8.   Group R functions
How To
How to impute missing data is complex.
More next term.
                               Scatter Plot




• Plotting 2 continuous-ish variables
• Exploring their association
• One of the most used and most
  useful techniques in science.
              40
                                  2. Decide the scale and draw the y axis
Estimated velocity (mph)
                    30         3. Add all the points

                               4. Label axes
  10      20




                                 1. Decide the scale and draw the x axis
              0




                           0          10       20        30                 40
                                      Actual velocity (mph)
Several ways to
make in SPSS.
Default shows what appears to be a negative relationship,
but the graphs can be improved.
       Graphing 3 Variables (London et al., 2007)




4- to 9-year olds
2 week recall
10 month recall
Can you see the 8s?
                                     Lines and Equations


                     200
                     150
Degrees Fahrenheit



                                 F = 32 + 9/5 C
                     100
                     50




                                             32 is the intercept
                                             9/5 (or 1.8) is the slope
                     0




                           -20   0      20    40       60      80        100
                                        Degrees Celsius
Other graph formats
                   70
                   60
Height in inches
                   50
                   40




                                     Is this the right approach?
                   30
                   20




                          5     10      15   20    25   30    35
                                         Age in years
                   Fitting a straight line (a linear relationship)
                              Heighti   0   1 Agei  ei
          Finding the Regression Line


•   Very general procedure (easily expanded)
•   Simple linear regression
•   Easiest way is just to draw a straight line yourself
•   A more formal method has some value

           Heighti   0   1 Agei  ei
    and finding the β0 and β1 which minimize Σei2

• Least Squares is also used in t test and mean
  (least absolute value is used for the median)
                            yi   19.25 1.92xi e i



             20
             15
y variable
             10
             5




                      Residuals shown with
                      veritical dashed lines
             0




                  0        2        4          6   8   10
                                      x variable
More on teaching all of you about fashion
minimizing the squared residuals: min Σei2

       Is least squares regression
       better than “eyeballing” it?

    Are there better formal methods?
Do you need to know the equations for β0 and β1?

        Not really

Would they be worth seeing once?

        Probably



                         ˆ
                         1   ( x  x )( y  y )
                                       i           i
        just look at,
        don't write
                                 (x  x)   i
                                                       2



                          ˆ         ˆ
                          0  y   1x

                         ˆ
                          means an estimate of 
Regressions sometimes used to predict values
                          1600


                          1400
  from Plasma in n/mols




                                                                 (data based on
                          1200
                                                                 Tytherleigh, 2002)

                          1000


                          800


                          600


                                 200     400      600      800
                                       from Saliva in n/mols
Running a regression in R




          lm is for Linear Model
    plasmai  512  1.01 salivai  ei


                   t (18)  10.24, p  .001


t  F (1,18)  105, r  .85, r  .85
2                    2         2
                              adj
               r2 or adjusted r2
                   and r or R

                    (1  r 2 )(n  1)
          2
        radj    1
                       n  k 1

           (1  .8536)(20  1)
r2
       1                      .8455
                 20  1  1
adj




        Shrunken estimate like ω2
           1201248
r  
 2   2
                        .85
       1201248 206009
Assessing the Fit: The Correlation
                            Equation bit


               r        ( x  x )( y  y )
                                  i        i

                        (x  x )  ( y  y)
                              i
                                      2
                                               i
                                                      2




• Top part determines whether positive or negative. If xi and yi are
  same side as their means, positive, otherwise negative.
• If as one goes up, the other goes up, positive.
       Correlation: Strength of the linear relationship

• Can get to it in several ways.
• The correlation squared in the proportion of shared variance.
• The correlation can range only from -1 to +1.

• Does a correlation between x and y mean x caused y?
• Does a correlation between x and y mean that there is some causal
  relationship in the network of hypotheses that include x and y?
• Are the most parsimonious ones x -> y and y -> x?
                          Significance Testing


H0: ρ (rho) = 0
Almost always use two tailed tests

You must know the sample size
r = 0.1 is significant with n=500 at 5%
r = 0.4 is not significant with n=20 at 5%

(Cohen sizes: .1 small, .3 medium, .5 large)
       Significance Testing and Confidence Intervals
                       The equations

                      r n2              r 2 (n  2)
                t                  , F
                        1 r2              1 r2
                      with df = n - 2, and df = 1, n - 2

            1 r 
r '  .5 ln      
            1 r 
                                             1
                      95 %CI r '  r '1.96
                                            n3
                                                              e2r '  1
                                                           r  2r '
                                                              e 1
               Making Confidence Intervals
Several programs on web. http://faculty.vassar.edu/lowry/rho.html
                        Notice the Normal and Basic
                        Bootstraps give impossible
                        upper bounds.




BCa very similar to asymptotic methods
r = .03 (p=.82)
r = .38 (p = .01)
r = .63 (p < .001)
r = .95 (p < .001)
                                                             influential
      outlier




         r=.64 (p=.05), w/o outlier r=.92, w/o infl. r=.38
r = .64, w/o outlier r = .92, w/o influential point r = .38
                    Assumptions for Significance

• Random sampling
• It must make sense to talk about the response variable (the “DV”) as
  being continuous.
• No weird patterns (or non-linear in general) in residuals. Variance of
  residual homoscedastic (ie., not varying by other variables -
  heteroscedastic)
• Examination of outliers
                                 What to do if assumptions not meet
                      (to get data, install and load mrt. data(crime) and attach)

                           Scatterplot on Raw data                                     Scatterplot on logs of data




                                                                               6
                                                     Regency                                                    Regency
                300




                                                                               5
                250
Drug offences




                                                                               4
                                                               ln(Drugs + 1)
                200




                                                                               3
                150




                                                                               2
                100




                                                                               1
                50
                0




                                                                               0



                      0    1000       2000      3000                               2     3     4     5     6    7     8

                               Theft offences                                                ln(Thefts + 1)
                    Ranked based Correlation

• Spearman's rho
• Rank the data and use Pearson's + stuff for ties.
• r = .94 and Spearman's rS = .78.
In SPSS and R just tick a box or change the method




           Doesn't print confidence interval
• Same correlation estimate.
• But the CI really does meet appropriate assumptions.
                              Break Time

• Short break.

                 In 4 groups (mix from different programs)

• Look at the handout that I am about to give you. Discuss how you
  would report your findings in a scientific journal versus People
  magazine. Are there any other statistics you would want to do?
• Talk about what you wrote for: Suppose an undergraduate said:
  "Since it is for looking at differences among means, why is it called an
  Analysis of Variance?"
HOW TOs
                           Some Examples

• Chile Heat: To discuss re-expression and what to do with outliers.
• Automobile Accidents: To discuss using theory to guide your statistics.


                                       +       +
                                       +       +
                                       +       +
                                       +       +
                                       +       +
                                       +       +
                                       +       +
                                       +       +
                                       +       +
                                       +       +
                                       +       +



                        Episode 425 - 2005 Nov. 9, 2008, "Dangerous Curves"
                               Are smaller chiles hotter?

     • How to measure length and heat.
     • Length skewed


                         Raw data                                  Transformed data
            30




                                                          15
Frequency




                                              Frequency
            20




                                                          10
            10




                                                          5
            0




                                                          0


                 0 5      15        25   35                     1.5       2.5         3.5
                       Length in cm                            ln (Length in cm + 2.54)
Testing Normality
         par(mfrow=c(1,2))
         qqnorm(LENGTH); qqline(LENGTH)
         qqnorm(log(LENGTH+2.54));qqline(log(LENGTH+2.54))
         par(mfrow=c(1,1))


                             Normal Q-Q Plot                                     Normal Q-Q Plot




                                                                      3.5
Sample Quantiles




                                                   Sample Quantiles
                   25




                                                                      2.5
                   15




                                                                      1.5
                   5




                        -2    -1   0     1     2                            -2    -1   0     1     2
                        Theoretical Quantiles                               Theoretical Quantiles
Measuring Heat: Scoville units or the number of chiles?
                               Raw data

                       lm(HEAT[LENGTH<30]~LENGTH[LENGTH<30])



              10
                                               lm(HEAT~LENGTH)

              8
                                                        Nu Mex
Heat in PJs



                                                        Big Jim
              6
              4
              2
              0




                   5     10      15       20       25      30
                            Length in cm
                           Transformed data

                           lm(HEAT[lnlength<3.4]~lnlength[lnlength<3.4])



              10
                                                   lm(HEAT~lnlength)

              8
                                                              Nu Mex
Heat in PJs


                                                              Big Jim
              6
              4
              2
              0




                   1.5   2.0          2.5           3.0           3.5
                         Transformed length
             Command Summary

r1 <- lm(HEAT~LENGTH)
r2 <- lm(HEAT[LENGTH<30]~LENGTH[LENGTH<30])
r3 <- lm(HEAT ~ log(LENGTH + 2.54))
                                                       Residuals vs Fitted                                                  Normal Q-Q
plot(r1)




                                                                                 Standardized residuals

                                                                                                          2
                                                                     43                                                                                43




                                               4
                                               2




                                                                                                          1
                     Residuals

                                               0




                                                                                                          0
                                               -4 -2




                                                                                                          -1
                                                                                                          -2
                                                                     2552
                                                                                                                 52 25


Nu Mex is hotter                                       3       4      5      6                                    -2        -1      0       1      2

than predicted for                                           Fitted values                                             Theoretical Quantiles

its length
                                                           Scale-Location                                         Residuals vs Leverage
                                               1.5
                      Standardized residuals




                                                                     2552




                                                                                 Standardized residuals
                                                                      43




                                                                                                          2
                                                                                                                                                       72   0.5




                                                                                                          1
                                               1.0




                                                                                                          0
                                               0.5




                                                                                                          -1
                                                                                                                            75




                                                                                                          -2
                                                                                                                                                            0.5
                                                                                                                       25   Cook's distance
                                               0.0




                                                       3       4      5      6                                 0.00      0.05       0.10        0.15

                                                             Fitted values                                                       Leverage
                 What to do with




Genetically
engineered.

Depends on the
population and
purpose.
                          What is a "linear model"

                                          1          1      ...   1 
                                          x1        x12     ... x1n 
[ y1   y2   ... yn ]   0  1  2  3  1                              [e e ... e ]
                                          x 21      x22     ... x 2n 
                                                                              1 2     n

                                                                       
                                          x1x 21   x1x 22   ... x1x 2n 




                                  Y = βX + e


                Don't worry if you dislike matrix notation
                    Vehicle-Pedestrian Accidents

• What is the relationship between the impact velocity of a vehicle and
  the throw of a pedestrian?
• A lot is known about how a body should move when hit by a car at a
  certain velocity.
• Good reason to suggest: throwi = k vi2 + ei




   Dan will glance around to see if anyone looks interested in "why" this
   equation makes sense, and may skip the next two slides.
                  Why Theoretical Sense? (frictionless)

Body takes on impact horizontal
velocity of the car, v, at an angle
above the horizontal.

Vertical velocity vy = v sin θ
Horizontal velocity vx = v cos θ


   Time in air, t, is related only to vy. t = 2 vy / g, where g is the constant for
   gravity on Earth, about 10m/s2..
Without friction, vx is constant and thus throw should be:


                2 vyi          2 v sin     2 cos  sin 
   throwi  vxi        v cos           v
                  g                g               g
and if θ is the same for all cars: throw = v2 k, where k is a constant.

Thus, throwi = k vi2 + ei

Simpler: Only 1 unknown (k) to solve for AND it has some empirical meaning
Wood, Simms & Walsh (2005)
               120

               100

               80     r = 0.82
Distance (m)



               60

               40

               20

                0
                 0    20    40     60    80   100   120
                            Velocity (kmph)

                 Otte's work with crash test dummies
                    reglin <- lm(distance ~ speed)
                    regpoly <- lm(distance ~ speed + spsq)
                    regmodel <- lm(distance ~ spsq - 1)
                    100




                                                        This can done in
Distance in yards
                    80




                                                        SPSS too. Tick no
                    60




                                                        intercept/constant.
                    40
                    20
                    0




                          0   50         100      150
                                   Speed in mph
                         Summary


is p sig?     is r     is it   does the plot   does it make
            large?   robust?    look right?      sense?

--------------
                        This week's journal

1. Try help(par)
2. Write an equation in Word
3. Access these data from web:
     http://www2.fiu.edu/~dwright/qm4psych/fishstock.dat
     http://www2.fiu.edu/~dwright/qm4psych/fishstock.sav

Variables (from EU Fishery Commission, 2007) are:
ocean (how much winter low temperature is above freezing in Celsius) and
fishstock (> 2cm in thousands per cubic kilometer).
      What are the correlation and the regression equation?
      Write a sentence about the results.
According to these data, who may have Luck next year?
                   Group Functions

•   Which groups (anybody program before?).
•   I will help with functions.
•   Think what would you like to be able to do.
•   Think how, in words, you could do it.
•   Drawing diagrams sometimes helps.
Better graphics
on google image
with past life
regression than
other sorts.

								
To top