Document Sample

Simple Linear Regression Statistical Reasoning 2, Lecture 4 Section A Review: The Equation of a Line The Equation of a Line Recall, from algebra, there are two values which uniquely define any line y-intercept—where the line crosses the y-axis (when x = 0) Slope—the ―rise over the run‖—how much y changes for each one unit change in x 3 The Equation of a Line Recall, from algebra, there are two values which uniquely define any line y = mx + b b = y-intercept m = slope 4 The Equation of a Line Of course statisticians must have their own notation! y = bo + b1x bo= y-intercept b1 = slope y = βo + β1x β o= y-intercept β1 = slope 5 The Intercept, βo The intercept βo is the value of y when x is 0 It is the point on the graph where the line crosses the y (vertical) axis, at the coordinate (0, βo ) y = βo + β1x βo 6 The Slope, β1 The slope β1 is the change in y corresponding to a unit increase in x y = βo + β1x 7 The Slope, β1 The slope β1 is the change in y corresponding to a unit increase in x β1 y = βo + β1x 8 The Slope, β1 The slope β1 is the change in y corresponding to a unit increase in x Another interpretation: β1 is difference in y-values for x+1 compared to x This change/difference is the same across the entire line 9 The Slope, β1 The slope β1 is the change in y corresponding to a unit increase in x β1 y = βo + β1x β1 β1 10 The Slope, β1 The slope β1 is the change in y corresponding to a unit increase in x: β1 is difference in y-values for x+1 compared to x All information about the difference in the y-value for two differing values of x is contained in the slope! For example: two values of x three units apart will have a difference in y values of 3* β1 11 The Slope, β1 For example: two values of x three units apart will have a difference in y values of 3* β1 β1 β1 β1 12 The Slope, β1 For example: two values of x three units apart will have a difference in y values of 3×β1 (3β1 ) β1 β1 3β1 β1 13 The Slope, β1 The slope β1 is the change in y corresponding to a unit increase in x: β1 is difference in y-values for x+1 compared to x If slope β1 = 0, indicates that there is no association: (i.e., the values of y are the same regardless of the values of x) If slope β1 > 0, indicates that there is a positive association: (i.e., the values of y increase with increasing values of x) If slope β1 < 0, indicates that there is a negative association: (i.e., the values of y decrease with increasing values of x) 14 The Slope, β1 The slope β1 is the change in y corresponding to a unit increase in x: β1 is difference in y-values for x+1 compared to x 15 The Equation of a Line In linear regression situations, points don‘t fit exactly to a line We estimate a line that relates the mean of an outcome y to a predictor x ˆ ˆ E[ y] 0 1 x E[y] =estimated ―expected‖ (mean) value of y ˆ 0 = estimated y-intercept ˆ 1 = estimated slope 16 The Equation of a Line ˆ ˆ o and 1 are called estimated regression coefficients These two quantities are estimated using the data Line estimated is line that ―fits the data best‖ Many times the equation just written as: ˆ ˆ y 0 1 x or ˆ ˆ ˆ y 0 1 x 17 The Equation of a Line ˆ ˆ o and 1 are called estimated regression coefficients ˆ We will see that in a regression context , 1 is nothing more than estimated mean difference in y between two groups who differ by one unit in x ie: how much the mean of y changes for a one-unit increase in x 18 Section B Linear Regression: Motivating Example Example: Arm Circumference and Height Data on anthropomorphic measures from a random sample of 150 Nepali children [0, 12) months old Question: what is the relationship between average arm circumference and height Data: Arm circumference: mean 12.4 cm, SD 1.5 cm, range 7.3 cm – 15.6 cm Height: mean 61.6 cm, SD 6.3 cm, range 40.9 cm – 73.3 cm 20 Approach 1: Arm Circumference and Height Dichotomize height at median, compare mean arm circumference with t-test and 95% CI 21 Approach 1: Arm Circumference and Height Potential Advantages: We know how to do it! Gives a single summary measure (sample mean difference) for quantifying the arm circumference/height association Potential Disadvantages: Throws away a lot of information in the height data that was originally measured as continuous Only allows for a single comparison between two crudely defined height categories 22 Approach 2 Arm Circumference and Height Categorize height into 4 categories by quartile, compare mean arm circumference with ANOVA, 95% CIs 23 Approach 2: Arm Circumference and Height Potential Advantages: We know how to do it! Uses a less crude categorization of height than the previous approach of dichotomizing Potential Disadvantages: Still throws away a lot of information in the height data that was originally measured as continuous Requires multiple summary measures (6 sample mean differences between each unique combination of height categories) to quantify arm circumference/height relationship Does not exploit the structure we see in the previous boxplot: as height increases so does arm circumference 24 Approach 3: Arm Circumference and Height What about treating height as continuous when estimating the arm circumference/height relationship Linear regression is a potential option: allows us to associate a continuous outcome with a continuous predictor via a line The line estimates the mean value of the outcome for each continuous value of height in the sample used Makes a lot of sense: but only if a line reasonably describes the outcome/predictor relationship Linear regression can also use binary or categorical predictors (will show later in this set of lectures) 25 Visualizing Arm Circumference and Height Relationship A useful visual display for assessing nature of association between two continuous variables: a scatterplot 26 Visualizing Arm Circumference and Height Relationship Question : does a line reasonably describe the general shape of the relationship between arm circumference and height? We can estimate a line, using the computer (details to come in subsequent lecture section) The line we estimate will be of the form: y o 1x ˆ ˆ Here: y is the average arm circumference for a group of children all of the same height, x 27 Example: Arm Circumference and Height Equation of regression line relating estimated mean arm circumference (cm) to height (cm) : from Stata y 2.7 0.16x ˆ Here , y estimated average arm circumference (like what we ˆ ˆ previously would call y ), x = height, o 2.7 and ˆ 1 0.16 This is the estimated line from the sample of 150 Nepali children 28 Example: Arm Circumference and Height Scatterplot with regression line superimposed y 2.7 0.16x ˆ 29 Example: Arm Circumference and Height Estimated mean arm circumference for children 60 cm in height y 2.7 0.16x ˆ for x 60 cm y 2.7 0.16 60 12 .3 cm ˆ 30 Example: Arm Circumference and Height Notice, most points don‘t fall directly on the line: we are estimating the mean arm circumference of children 60 cm tall: observed points vary about the estimated mean y 2.7 0.16x ˆ for x 60 cm y 2.7 0.16 60 12 .3 cm ˆ 31 Example: Arm Circumference and Height How to interpret estimated slope? y 2.7 0.16x ˆ ˆ Here , 1 0.16 Two ways to say the same thing: ˆ 1 is the average change in arm circumference for a one-unit (1 cm) increase in height ˆ 1 is the mean difference in arm circumference for two groups of children who differ by one-unit (1 cm) in height, taller to shorter This results estimates that the mean difference in arm circumferences for a one cm difference in height is 0.16 cm, with taller children having greater average arm circumference. 32 Example: Arm Circumference and Height This mean difference estimate is constant across the entire height range in the sample: definition of a slope of a line y 2.7 0.16x ˆ 33 Example: Arm Circumference and Height What is estimated mean difference in arm circumference for: Children 60 cm tall versus children 59 cm tall? Children 25 cm tall versus children 24 cm tall? Children 72 cm tall versus children 71 cm tall? Etc….? Answer is the same for all of the above: 0.16 cm 34 Example: Arm Circumference and Height What is estimated mean difference in arm circumference for: Children 60 cm tall versus children 50 cm tall? ˆ y x60 y x50 10 1 10 0.16 cm 1.6 cm ˆ ˆ 35 Example: Arm Circumference and Height What is estimated mean difference in arm circumference for: Children 90 cm tall versus children 89 cm tall? Children 34 cm tall versus children 33 cm tall? Children 110 cm tall versus children 109 cm tall? Etc….? This is a trick question!!!! 36 Example: Arm Circumference and Height The range of observed heights in the sample is 40.9 cm – 73.3 cm: our regression results only apply to the relationship between arm circumference and height for this height range y 2.7 0.16x ˆ 37 Example: Arm Circumference and Height How to interpret estimated intercept? y 2.7 0.16x ˆ ˆ Here , o 2.7cm This is the estimated y when x =0: the estimated mean arm circumference for children 0 cm tall Does this make sense given our sample? As we noted before, estimate of mean arm circumferences only apply to observed height range. Frequently, the scientific interpretation of the intercept is scientifically meaningless: but this intercept is necessary for fully specify equation of line and make estimates of mean arm circumference for groups of children with heights in sample range. 38 Example: Arm Circumference and Height Notice the x=0 is not even on this graph (the vertical axis is at x=39) y 2.7 0.16x ˆ 39 Example: Arm Circumference and Height Notice the x=0 is not even on this graph (the vertical axis is at x=39) y 2.7 0.16x ˆ 40 Section C Simple Linear Regression : More Examples Example: Hb and PCV Linear regressions performed with a single predictor (one x) are called simple linear regressions Linear regressions performed with a more than one predictor (x‘s) are called multiple linear regressions In this set of lectures we are dealing with simple linear regression: in this section we will give three more examples 42 Example: Hb and PCV Data on laboratory measurements on a random sample of 21 clinic patients 20-67 years old Question: what is the relationship between hemoglobin levels (g/dL) and packed cell volume (percent of packed cells) Data: Hemoglobin (Hb): mean 14.1 g/dl, SD 2.3 g/dL, range 9.6 g/dL – 17.1 g/dL Packed Cell Volume (PCV): mean 41.1 %, SD 8.1 %, range 25% to 55% 43 Visualizing Hb and PCV Relationship Scatterplot display 44 Example: Hb and PCV Equation of regression line relating estimated mean Hemoglobin (g/dL) to packed cell volume : from Stata y 5.77 0.20x ˆ Here , y estimated average Hemoglobin (like what we ˆ ˆ previously would call y ), x = height, o 5.77 and ˆ 1 0.20 This is the estimated line from the sample of 21 subjects 45 Example: Hb and PCV Equation of regression line relating estimated mean Hemoglobin (g/dL) to packed cell volume : from Stata y 5.77 0.20x ˆ ˆ 1 0.20 : what are the units? ˆ ˆ Well , y is in g/dL, x in percent; so 1 is in units if g/dL per percent This results estimates that the mean difference in Hemoglobin levels for two groups of subjects who differ by 1% in PCV is 0.20 g/dL: subjects with greater PCV have greater Hb levels in average. 46 Visualizing Hb and PCV Relationship Scatterplot display with regression line y 5.77 0.20 x ˆ 47 Example: Hb and PCV What is average difference in Hb levels for subjects with PCV of 40% compared to subjects with 32%? ˆ 1 0.20 : compares groups of subjects who differ in PCV by 1% (it is positive, so those with the greater PCV have hemoglobin levels of .20 g/dL greater on average) To compare subjects with PCV of 40% versus subjects with 32%, which is an 8 unit difference in x, take ˆ 8 1 8 0.20 1.6 g / dL 48 Example: Hb and PCV What is estimated Hb level for subjects with PCV of 41% ? y 5.77 0.20 x ˆ Plugging 41% into the equation, y 5.77 0.20 41 13.97 g / dL ˆ 49 Example: Wages and Education Level Data on hourly wages from a random sample of 534 U.S. workers in 1985 Question: what is the relationship between hourly wage (US$) and years of formal education Data: Hourly wages : mean $9.04/hr, SD $5.13/hr, range $1.00/hr– $44.50/hr Year of formal education: mean 13.0 years, SD 2.6 years, range 2 years – 18 years 50 Visualizing Wages and Education Level Relationship Scatterplot display 51 Example: Wages and Education Level Equation of regression line relating estimated mean hourly wages (US $) to years of education : from Stata y 0.75 0.75x ˆ Here , y estimated average hourly wage (like what we ˆ previously would call y ), x = years of formal education , ˆ ˆ o 0.75 and 1 0.75 This is the estimated line from the sample of 534 subjects 52 Visualizing Wages and Education Level Relationship Scatterplot display with regression line 53 Wages and Education Level What is interpretation of the slope estimate? 54 Example: Arm Circumference and Sex Data on anthropomorphic measures from a random sample of 150 Nepali children [0, 12) months old Question: what is the relationship between average arm circumference and sex of a child Data: Arm circumference: mean 12.4 cm, SD 1.5 cm, range 7.3 cm – 15.6 cm Sex: 51% female 55 Visualizing Arm Circumference and Sex Relationship Scatterplot display 56 Visualizing Arm Circumference and Sex Relationship Boxplot display 57 Example: Arm Circumference and Sex Here, y is arm circumference, a continuous measure: x is not continuous, but binary – male or female How to handle sex as a ―x‖ in regression? One possibility: x = 0 for male children, x =1 for female children The equation we will estimate ˆ ˆ ˆ y 0 1 x How to interpret regression coefficients? 58 Example: Arm Circumference and Sex Notice: this equation is only estimating two values: mean arm circumference for male children, and the mean for female children For female children: ˆ ˆ ˆ ˆ ˆ y 0 1 1 0 1 For male children ˆ ˆ ˆ ˆ y 0 1 0 0 So ˆ 1 is still a slope estimating mean difference in y for one-unit difference in x: but only possible one-unit difference is 1 (females) to 0 (males) ˆ o actually has substantive meaning in this example: it is the average arm circumference for male children 59 Example: Arm Circumference and Sex The resulting equation y 12.5 0.13x ˆ ˆ 1 0.13 : the estimated mean difference in arm circumference for female children compared to male children is -0.13 cm; female children have lower arm circumference by 0.13 cm on average ˆ o 12.5 : the mean arm circumference for male children is 12.5 cm 60 Visualizing Arm Circumference and Sex Relationship Scatterplot display with regression line 61 Section D Simple Linear Regression Model: Estimating the Regression Equation—Accounting for Uncertainly in the Estimates Example: Hemoglobin and Packed Cell Volume So in the last section, we showed the results from several simple linear regression models For example, when relating arm circumference to height using a random sample of 150 Nepali children < 12 months old, I told you that the resulting regression equation was: y 2.7 0.16x ˆ I told you this came from Stata, and will show you how to do regression with Stata shortly: but how does Stata estimate this equation? 63 Example: Arm Circumference and Height There must be some algorithm that will always yield the same results for the same data set 64 Example: Arm Circumference and Height The algorithm to estimate the equation of the line is called the ―least squares‖ estimation The idea is to find the line that gets ―closest‖ to all of the points in the sample How to define closeness to multiple points? In regression, closeness is defined as the cumulative squared distance between each point‘s y-value and the corresponding value ˆ of y for that point‘s x : in other words the squared distance between an observed y-value and the estimated y-value for all points with the same value of x. 65 Example: Arm Circumference and Height ˆ ˆ Each distance is y y y ( o B1 x) : this is computed for each ˆ data point in the sample 66 Example: Arm Circumference and Height The algorithm to estimate the equation of the line is called the ―least squares‖ estimation ˆ ˆ The values chosen for o and 1 are the values that minimize the cumulative distances squared: i.e. n ˆ x ) 2 min yi ( o ˆ1 i i 1 67 Example: Arm Circumference and Height ˆ ˆ The values chosen for o and 1 are just estimates based on a single sample. If were to have a different random sample of 150 Nepali children from the same population of <12 month olds, the resulting estimate would likely be different: i.e. the values that minimized the cumulative squared distance from this second sample of points would likely be different As such, all regression coefficients have an associated standard error that can be used to make statements about the true relationship between mean y and x (for example, the true slope 1 ) based on a single sample 68 Example: Arm Circumference and Height The estimated regression equation relating arm circumference to height using a random samples of 150 Nepali children < 12 months old, I told you that the resulting regression equation was: y 2.7 0.16x ˆ ˆ ˆ ˆ 1 0.16 and SE( 1 ) 0.014 ˆ ˆ ˆ o 2.70 and SE( o ) 0.88 69 Example: Arm Circumference and Height Random sampling behavior of estimated regression coefficients is normal for large samples (n>60), and centered at true values As such, we can use same ideas to create 95% CIs and get p-values 70 Example: Arm Circumference and Height The estimated regression equation relating arm circumference to height using a random samples of 150 Nepali children < 12 months old, I told you that the resulting regression equation was: y 2.7 0.16x ˆ ˆ ˆ ˆ 1 0.16 and SE( 1 ) 0.014 95% CI for β1 ˆ ˆ ˆ 1 2 SE ( 1 ) 0.16 2 0.014 (0.13,0.19 ) 71 Example: Arm Circumference and Height p-value for testing: Ho: β1 =0 Ho: β1 =0 ˆ Assume null true, and calculate standardized ―distance ― of 1 from 0 ˆ 1 0 ˆ 1 0.16 t 11.4 ˆ ˆ SE ( 1 ) SE ( 1 ) .014 p-value is probability of being 11.4 or more standard errors away from mean of 0 on a normal curve: very low in this example, p < .001 72 Summarizing findings: Arm Circumference and Height This research used simple linear regression to estimate the magnitude of the association between arm circumference and height in Nepali children less than 12 months old, using data on a random sample of 150. A statistically significant positive association was found (p<.001). The results estimate that two groups of such children who differ by 1 cm in height will differ on average by 0.16 cm in arm circumference. (95% CI 0.13 cm to 0.19 cm) 73 Summarizing findings: Arm Circumference and Height Finally: Stata! If you have your ―y‖ and ―x‖ values entered in Stata, then to do linear regression use the regress command: regress y x Data snippet from Stata 74 Using Stat: Arm Circumference and Height regress armcirc height y 2.7 0.16x ˆ 75 Using Stat: Arm Circumference and Height regress armcirc height y 2.7 0.16x ˆ 76 Using Stat: Arm Circumference and Height regress armcirc height ˆ o y 2.7 0.16x ˆ 77 Using Stat: Arm Circumference and Height regress armcirc height ˆ 1 y 2.7 0.16x ˆ 78 Using Stat: Arm Circumference and Height regress armcirc height y 2.7 0.16x ˆ 79 Example 2: Arm Circumference and Height Give an estimate and 95% CI for the mean difference in arm circumference for children 60 cm tall compared to children 50 cm tall From previous set we know this estimated mean difference is ˆ ˆ (60 50 ) 1 10 1 10 0.16 1.6 cm How to get standard error? Well as it turns out: ˆ ˆ ˆ ˆ SE (10 1 ) 10 SE ( 1 ) ˆ ˆ SE (10 1 ) 10 0.014 0.14 95% CI for the mean difference ˆ ˆ ˆ 10 1 2SE (10 1 ) 1.6 2 0.14 80 Example 2: Hemoglobin and ―Packed Cell Volume‖ Equation of regression line relating estimated mean Hemoglobin (g/dL) to packed cell volume : from Stata y 5.77 0.20 x ˆ Snippet of data in Stata 81 Example 2: Hemoglobin and ―Packed Cell Volume‖ regress Hb PCV 82 Example 2: Hemoglobin and ―Packed Cell Volume‖ Same idea with computation of 95% CI and p-value as we saw before However, with small (n<60) samples, a slight change analaguous to what we did with means and differences in means before Sampling distribution of regression coefficients not quite normal, but follow a t-distribution with n-2 degrees of freedom 95% for 1 ˆ ˆ ˆ 1 t.95,n2 SE(1 ) In this example ˆ ˆ ˆ 1 t.95,19 SE(1 ) 0.20 2.09 .046 (0.10,0.30) 83 Example: Hemoglobin and ―Packed Cell Volume‖ p-value for testing: Ho: β1 =0 Ho: β1 =0 ˆ Assume null true, and calculate standardized ―distance ― of 1 from 0 ˆ 1 0 ˆ 1 0.20 t 4.35 ˆ ˆ ( ) SE ( ) .046 SE 1 1 p-value is probability of being 4.35 or more standard errors away from mean of 0 on a t curve with 19 degrees of freedom: very low in this example, p < .001 84 Interpreting Result of 95% CI So, the estimated slope is 0.2 with 95% CI 0.10 to 0.30 How to interpret results? Based on a sample of 21 subjects, we estimated that PCV(%) is positively associated with hemoglobin levels We estimated that a one-percent increase in PCV is associated with a 0.2 g/dL increase in hemoglobin on average Accounting for sampling variability, this mean increase could be as small as 0.10 g/dL, or as large as 0.3 g/dL in the population of all such subjects 85 Interpreting Result of 95% CI In other words: We estimated that the average difference in hemoglobin levels for two groups of subjects who differ by one-percent in PCV to be 0.2 g/dL on average (higher PCV group relative to lower) Accounting for sampling variability, mean difference could be as small as 0.10 g/dL, or as large as 0.3 g/dL in the population of all subjects 86 What about Intercepts? In this section, all examples have confidence intervals for the slope, and multiples of the slope We can also create confidence intervals/p-values for the intercept in the same manner (and Stata presents it in the output). However as we noted in the previous section, many times the intercept is just a placeholder and does not describe a useful quantity: as such, 95% CIs and p-values are not always relevant 87 Section E Measuring the Strength of A Linear Association Strength of Association The slope of a regression line estimates the magnitude and direction of the relationship between y and x: it encapsulates how much y differs on average with differences in x The slope estimate and standard error can be used to address the uncertainty in the this estimate with regards to the true magnitude and direction of the association in the population from which the sample was taken from Slopes do not impart any information about how well the regression line fits the data in the sample; the slope gives no indication of how close the points get to the estimated regression line 89 Strength of Association Another quantity that can be estimated via linear regression is the coefficient of determination , R2: this is a number that ranges from 0 to 1, with larger values indicate ―closer fits‖ of the data points and regression line R2 measures strength of association by comparing variability of points around the regression line to variability in y-values ignoring x 90 Example: Arm Circumference and Height How close do the points get to the line – can we quantify? 91 Example: Arm Circumference and Height (SR1 Flashback) The sample standard deviation of the y-values ignoring the corresponding potential information in x is n (y i yi ) 2 s i 1 n 1 this measures how far on average each of the sample y values falls from the overall mean all y-values In this example s=1.48 cm 92 Example: Arm Circumference and Height ―Visualization‖ on the scatterplot 93 Example: Arm Circumference and Height Standard deviation of regression, referred to as root mean square error is ―average‖ distance of points from the line: how far on average each y falls from its mean predicted by the its corresponding x-value n (y i yi ) 2 ˆ s y| x i 1 n2 94 Example: Arm Circumference and Height ˆ ˆ Each distance is y y y ( o B1 x) : this is computed for each ˆ data point in the sample 95 Using Stata: Arm Circumference and Height regress command in Stata gives sy|x 96 Example: Arm Circumference and Height If s = sy|x, then knowing x does not yield a better guess for the mean of y than using the overall mean y (flat regression line) The smaller sy|x is relative to s, the closer the points are to the regression line R2 functionally measures how much smaller sy|x is than s: as such it is an estimate of the amount of variability in y explained by taking x into account 97 Using Stata: Arm Circumference and Height regress command in Stata gives R2: childs‘ height explains (an estimated) 46% of the variation in arm circumferences 98 Example: Arm Circumference and Height R2 and r r = the properly signed square root of R2; the proper sign is the same sign as the slope in the regression r is called the correlation coefficient (not to be confused with the ―regression coefficients‖ – great names, huh) Allowable values 0 ≤ R2 ≤ 1 If relationship between y and x is positive 0 ≤ r ≤ 1 If relationship between y and x is negative -1 ≤ r ≤ 0 In this example, r R 2 0.46 0.68 99 Example: Arm Circumference and Height So from the example: child height explains (an estimated) 46% of the variation in arm circumferences This is just an estimate based on the sample; a 95% CI can be computed but its not easy to do, and not given readily by the computer; also the procedure for estimating the 95% CI is not so good So this means an estimated 54% of the variability in arm circumference is not explained by childs height Some if this unexplained variability may be explained by factors other then height Multiple linear regression will allow us to estimate the relationship between arm circumference, height and other child characteristics in one analysis 100 Example 2: Hemoglobin and ―Packed Cell Volume‖ regress command in Stata gives R2: PCV explains (an estimated) 51% of the variation in hemoglobin levels 101 Example: Hemoglobin and PCV regress command in Stata gives R2 of 0.51; the slope is positive, so r R 2 0.51 0.71 102 Example 3: Wages and Years of Education regress command in Stata gives R2: years of education explains (an estimated) 15% of the variation in hourly wages Here r R 2 0.15 0.39 103 Example 4: Arm Circumference and Child Sex regress command in Stata gives R2: sex(female=1) explains (an estimated) 0.2% of the variation in arm circumference Here r R 0.002 0.045 . In this sample of data female sex is 2 negatively correlated with arm circumference. 104 What‘s a ―Good‖ R2 There are a couple of important things to keep in mind about R2 and r - These quantities are both estimates based on the sample of data; frequently reported without some recognition of sampling variability, for example a 95% confidence interval - Low R2 and r not necessarily ―bad‖ - many outcomes can not/ will not be fully or close to fully explained, in terms of variability, by any one single predictor 105 What‘s a ―Good‖ R2 The higher the R2 values, the better the x predicts y for individuals in a sample/population , as individual y-values vary less about their estimated means based on x However, there may be important overall associations between mean of y and x even though still a lot of individual variability in y- values about their means estimated by x In the wages example, years of education explained an estimated 15% of the variability in hourly wages The association was statistically significant showing that average wages were greater for persons with more years of education However, for any single education level (year), still a lot of variation in wages for individual workers 106 Slope versus R2 Slope estimates the magnitude and direction of the relationship between y and x Estimates a mean difference in y for two groups who differ by one-unit in x The slope will change if the units change for y and/or for x Larger slopes not indicative of stronger linear association: smaller slopes not indicative of weaker linear association R2 measures strength of linear association; r measures strength and direction Neither R2 or r measures magnitude Neither R2 or r changes with changes in units 107 Using Stat: Arm Circumference and Height Regression of arm circumference (cm) on height in centimeters y 2.7 0.16x ˆ ˆ R2 = 0.46 or 46%; 1 0.16 108 Using Stat: Arm Circumference and Height Regression of arm circumference on height in inches . regress armcirc height_inch Source | SS df MS Number of obs = 150 -------------+------------------------------ F( 1, 148) = 124.30 Model | 148.874589 1 148.874589 Prob > F = 0.0000 Residual | 177.263343 148 1.19772529 R-squared = 0.4565 -------------+------------------------------ Adj R-squared = 0.4528 Total | 326.137932 149 2.18884518 Root MSE = 1.0944 ------------------------------------------------------------------------------ armcirc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- height_inch | .4008806 .035957 11.15 0.000 .3298251 .471936 _cons | 2.695906 .8774225 3.07 0.003 .9620119 4.429801 ------------------------------------------------------------------------------ y 2.7 0.40x ˆ ˆ R2 = 0.46 or 46%; 1 0.40 109 Section F Optional: Some FYIs about SLR Standard Error of Slopes Just FYI: standard error of estimated slope a combination of variation in y-values around regression line, and spread of x values Definition: standard deviation of regression, called ‗root mean squared error‘ is functionally average distance of any single point from estimated mean of all y-values with same x, ie: corresponding value on regression line For simple linear regression, d.f.=n-2 and n (y i yi ) ˆ s y| x i 1 n2 Estimated standard error of slope estimate s y| x ˆ SE ( 1 ) n (x i 1 i x)2 111 Standard Error of Slopes Estimated standard error of slope estimate s y| x ˆ ˆ SE ( 1 ) n (x i 1 i x)2 Notice this will be larger The more variable the y-values are around their corresponding mean estimates on the regression line (ie: the greater sy|x is) The less variable the x-values are around the mean of x: hmm… 112 Actually Computation of R2 How do we actually compute R2? Recall interpretation: percent of variability in y explained by x Total Variability in y? Actually, for the R2 computation n total variability in y (n - 1) s (yi - y) 2 2 i 1 113 Example: Arm Circumference and Height ―Visualization‖ on the scatterplot: distance of each point from the flat line at y squared and added together 114 R2:Arm Circumference and Height Regression of arm circumference on height in centimeters: total variability in y 115 Actually Computation of R2 Total Variability in y not explained by x? For the R2 computation n total variability in y not explained by x (n 2) s 2 y| x (yi - y i ) 2 ˆ i 1 116 Example: Arm Circumference and Height ˆ ˆ Each distance is y y y ( o B1 x) : this is computed for each ˆ data point in the sample , squared and summed 117 R2:Arm Circumference and Height Regression of arm circumference on height in centimeters: total variability in y not explained by x 118 Actually Computation of R2 Percentage of variability in y NOT explained by x n (y i 1 i ˆ - yi )2 n ( y y) 2 i 1 R2 is percentage of variability in y explained by x n (y i ˆ - yi )2 177 .26 1 i 1 n 1 1 0.54 0.46 ( y y) 2 326 .14 i 1 119

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 26 |

posted: | 7/30/2011 |

language: | English |

pages: | 119 |

OTHER DOCS BY MikeJenny

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.