VIEWS: 21 PAGES: 78 POSTED ON: 7/30/2012 Public Domain
Regression Lecture 9 http://galton.org/ Aims for Today - Regression 0. Repeated measures in SPSS 1. Drawing lines on scatter plots 2. The regression line: Predicting values 3. Correlation 4. Ranked based correlation 5. Break/Handout 6. How tos 7. Examples by Dan Chile and maybe being hit by a car 8. Group R functions How To How to impute missing data is complex. More next term. Scatter Plot • Plotting 2 continuous-ish variables • Exploring their association • One of the most used and most useful techniques in science. 40 2. Decide the scale and draw the y axis Estimated velocity (mph) 30 3. Add all the points 4. Label axes 10 20 1. Decide the scale and draw the x axis 0 0 10 20 30 40 Actual velocity (mph) Several ways to make in SPSS. Default shows what appears to be a negative relationship, but the graphs can be improved. Graphing 3 Variables (London et al., 2007) 4- to 9-year olds 2 week recall 10 month recall Can you see the 8s? Lines and Equations 200 150 Degrees Fahrenheit F = 32 + 9/5 C 100 50 32 is the intercept 9/5 (or 1.8) is the slope 0 -20 0 20 40 60 80 100 Degrees Celsius Other graph formats 70 60 Height in inches 50 40 Is this the right approach? 30 20 5 10 15 20 25 30 35 Age in years Fitting a straight line (a linear relationship) Heighti 0 1 Agei ei Finding the Regression Line • Very general procedure (easily expanded) • Simple linear regression • Easiest way is just to draw a straight line yourself • A more formal method has some value Heighti 0 1 Agei ei and finding the β0 and β1 which minimize Σei2 • Least Squares is also used in t test and mean (least absolute value is used for the median) yi 19.25 1.92xi e i 20 15 y variable 10 5 Residuals shown with veritical dashed lines 0 0 2 4 6 8 10 x variable More on teaching all of you about fashion minimizing the squared residuals: min Σei2 Is least squares regression better than “eyeballing” it? Are there better formal methods? Do you need to know the equations for β0 and β1? Not really Would they be worth seeing once? Probably ˆ 1 ( x x )( y y ) i i just look at, don't write (x x) i 2 ˆ ˆ 0 y 1x ˆ means an estimate of Regressions sometimes used to predict values 1600 1400 from Plasma in n/mols (data based on 1200 Tytherleigh, 2002) 1000 800 600 200 400 600 800 from Saliva in n/mols Running a regression in R lm is for Linear Model plasmai 512 1.01 salivai ei t (18) 10.24, p .001 t F (1,18) 105, r .85, r .85 2 2 2 adj r2 or adjusted r2 and r or R (1 r 2 )(n 1) 2 radj 1 n k 1 (1 .8536)(20 1) r2 1 .8455 20 1 1 adj Shrunken estimate like ω2 1201248 r 2 2 .85 1201248 206009 Assessing the Fit: The Correlation Equation bit r ( x x )( y y ) i i (x x ) ( y y) i 2 i 2 • Top part determines whether positive or negative. If xi and yi are same side as their means, positive, otherwise negative. • If as one goes up, the other goes up, positive. Correlation: Strength of the linear relationship • Can get to it in several ways. • The correlation squared in the proportion of shared variance. • The correlation can range only from -1 to +1. • Does a correlation between x and y mean x caused y? • Does a correlation between x and y mean that there is some causal relationship in the network of hypotheses that include x and y? • Are the most parsimonious ones x -> y and y -> x? Significance Testing H0: ρ (rho) = 0 Almost always use two tailed tests You must know the sample size r = 0.1 is significant with n=500 at 5% r = 0.4 is not significant with n=20 at 5% (Cohen sizes: .1 small, .3 medium, .5 large) Significance Testing and Confidence Intervals The equations r n2 r 2 (n 2) t , F 1 r2 1 r2 with df = n - 2, and df = 1, n - 2 1 r r ' .5 ln 1 r 1 95 %CI r ' r '1.96 n3 e2r ' 1 r 2r ' e 1 Making Confidence Intervals Several programs on web. http://faculty.vassar.edu/lowry/rho.html Notice the Normal and Basic Bootstraps give impossible upper bounds. BCa very similar to asymptotic methods r = .03 (p=.82) r = .38 (p = .01) r = .63 (p < .001) r = .95 (p < .001) influential outlier r=.64 (p=.05), w/o outlier r=.92, w/o infl. r=.38 r = .64, w/o outlier r = .92, w/o influential point r = .38 Assumptions for Significance • Random sampling • It must make sense to talk about the response variable (the “DV”) as being continuous. • No weird patterns (or non-linear in general) in residuals. Variance of residual homoscedastic (ie., not varying by other variables - heteroscedastic) • Examination of outliers What to do if assumptions not meet (to get data, install and load mrt. data(crime) and attach) Scatterplot on Raw data Scatterplot on logs of data 6 Regency Regency 300 5 250 Drug offences 4 ln(Drugs + 1) 200 3 150 2 100 1 50 0 0 0 1000 2000 3000 2 3 4 5 6 7 8 Theft offences ln(Thefts + 1) Ranked based Correlation • Spearman's rho • Rank the data and use Pearson's + stuff for ties. • r = .94 and Spearman's rS = .78. In SPSS and R just tick a box or change the method Doesn't print confidence interval • Same correlation estimate. • But the CI really does meet appropriate assumptions. Break Time • Short break. In 4 groups (mix from different programs) • Look at the handout that I am about to give you. Discuss how you would report your findings in a scientific journal versus People magazine. Are there any other statistics you would want to do? • Talk about what you wrote for: Suppose an undergraduate said: "Since it is for looking at differences among means, why is it called an Analysis of Variance?" HOW TOs Some Examples • Chile Heat: To discuss re-expression and what to do with outliers. • Automobile Accidents: To discuss using theory to guide your statistics. + + + + + + + + + + + + + + + + + + + + + + Episode 425 - 2005 Nov. 9, 2008, "Dangerous Curves" Are smaller chiles hotter? • How to measure length and heat. • Length skewed Raw data Transformed data 30 15 Frequency Frequency 20 10 10 5 0 0 0 5 15 25 35 1.5 2.5 3.5 Length in cm ln (Length in cm + 2.54) Testing Normality par(mfrow=c(1,2)) qqnorm(LENGTH); qqline(LENGTH) qqnorm(log(LENGTH+2.54));qqline(log(LENGTH+2.54)) par(mfrow=c(1,1)) Normal Q-Q Plot Normal Q-Q Plot 3.5 Sample Quantiles Sample Quantiles 25 2.5 15 1.5 5 -2 -1 0 1 2 -2 -1 0 1 2 Theoretical Quantiles Theoretical Quantiles Measuring Heat: Scoville units or the number of chiles? Raw data lm(HEAT[LENGTH<30]~LENGTH[LENGTH<30]) 10 lm(HEAT~LENGTH) 8 Nu Mex Heat in PJs Big Jim 6 4 2 0 5 10 15 20 25 30 Length in cm Transformed data lm(HEAT[lnlength<3.4]~lnlength[lnlength<3.4]) 10 lm(HEAT~lnlength) 8 Nu Mex Heat in PJs Big Jim 6 4 2 0 1.5 2.0 2.5 3.0 3.5 Transformed length Command Summary r1 <- lm(HEAT~LENGTH) r2 <- lm(HEAT[LENGTH<30]~LENGTH[LENGTH<30]) r3 <- lm(HEAT ~ log(LENGTH + 2.54)) Residuals vs Fitted Normal Q-Q plot(r1) Standardized residuals 2 43 43 4 2 1 Residuals 0 0 -4 -2 -1 -2 2552 52 25 Nu Mex is hotter 3 4 5 6 -2 -1 0 1 2 than predicted for Fitted values Theoretical Quantiles its length Scale-Location Residuals vs Leverage 1.5 Standardized residuals 2552 Standardized residuals 43 2 72 0.5 1 1.0 0 0.5 -1 75 -2 0.5 25 Cook's distance 0.0 3 4 5 6 0.00 0.05 0.10 0.15 Fitted values Leverage What to do with Genetically engineered. Depends on the population and purpose. What is a "linear model" 1 1 ... 1 x1 x12 ... x1n [ y1 y2 ... yn ] 0 1 2 3 1 [e e ... e ] x 21 x22 ... x 2n 1 2 n x1x 21 x1x 22 ... x1x 2n Y = βX + e Don't worry if you dislike matrix notation Vehicle-Pedestrian Accidents • What is the relationship between the impact velocity of a vehicle and the throw of a pedestrian? • A lot is known about how a body should move when hit by a car at a certain velocity. • Good reason to suggest: throwi = k vi2 + ei Dan will glance around to see if anyone looks interested in "why" this equation makes sense, and may skip the next two slides. Why Theoretical Sense? (frictionless) Body takes on impact horizontal velocity of the car, v, at an angle above the horizontal. Vertical velocity vy = v sin θ Horizontal velocity vx = v cos θ Time in air, t, is related only to vy. t = 2 vy / g, where g is the constant for gravity on Earth, about 10m/s2.. Without friction, vx is constant and thus throw should be: 2 vyi 2 v sin 2 cos sin throwi vxi v cos v g g g and if θ is the same for all cars: throw = v2 k, where k is a constant. Thus, throwi = k vi2 + ei Simpler: Only 1 unknown (k) to solve for AND it has some empirical meaning Wood, Simms & Walsh (2005) 120 100 80 r = 0.82 Distance (m) 60 40 20 0 0 20 40 60 80 100 120 Velocity (kmph) Otte's work with crash test dummies reglin <- lm(distance ~ speed) regpoly <- lm(distance ~ speed + spsq) regmodel <- lm(distance ~ spsq - 1) 100 This can done in Distance in yards 80 SPSS too. Tick no 60 intercept/constant. 40 20 0 0 50 100 150 Speed in mph Summary is p sig? is r is it does the plot does it make large? robust? look right? sense? -------------- This week's journal 1. Try help(par) 2. Write an equation in Word 3. Access these data from web: http://www2.fiu.edu/~dwright/qm4psych/fishstock.dat http://www2.fiu.edu/~dwright/qm4psych/fishstock.sav Variables (from EU Fishery Commission, 2007) are: ocean (how much winter low temperature is above freezing in Celsius) and fishstock (> 2cm in thousands per cubic kilometer). What are the correlation and the regression equation? Write a sentence about the results. According to these data, who may have Luck next year? Group Functions • Which groups (anybody program before?). • I will help with functions. • Think what would you like to be able to do. • Think how, in words, you could do it. • Drawing diagrams sometimes helps. Better graphics on google image with past life regression than other sorts.