VIEWS: 13 PAGES: 54 CATEGORY: Business POSTED ON: 8/23/2011
Simple Linear Regression Template document sample
10-1 COMPLETE BUSINESS STATISTICS by AMIR D. ACZEL & JAYAVEL SOUNDERPANDIAN 6th edition (SIE) 10-2 Chapter 10 Simple Linear Regression and Correlation 10-3 10 Simple Linear Regression and Correlation • Using Statistics • The Simple Linear Regression Model • Estimation: The Method of Least Squares • Error Variance and the Standard Errors of Regression Estimators • Correlation • Hypothesis Tests about the Regression Relationship • How Good is the Regression? • Analysis of Variance Table and an F Test of the Regression Model • Residual Analysis and Checking for Model Inadequacies • Use of the Regression Model for Prediction • The Solver Method for Regression 10-4 10 LEARNING OBJECTIVES After studying this chapter, you should be able to: • Determine whether a regression experiment would be useful in a given instance • Formulate a regression model • Compute a regression equation • Compute the covariance and the correlation coefficient of two random variables • Compute confidence intervals for regression coefficients • Compute a prediction interval for the dependent variable 10-5 10 LEARNING OBJECTIVES (continued) After studying this chapter, you should be able to: • Test hypothesis about a regression coefficients • Conduct an ANOVA experiment using regression results • Analyze residuals to check if the assumptions about the regression model are valid • Solve regression problems using spreadsheet templates • Apply covariance concept to linear composites of random variables • Use LINEST function to carry out a regression 10-6 10-1 Using Statistics • Regression refers to the statistical technique of modeling the relationship between variables. • In simple linear regression, we model the relationship between two variables. • One of the variables, denoted by Y, is called the dependent variable and the other, denoted by X, is called the independent variable. • The model we will use to depict the relationship between X and Y will be a straight-line relationship. • A graphical sketch of the the pairs (X, Y) is called a scatter plot. 10-7 10-1 Using Statistics This scatterplot locates pairs of observations of Scatterplot of Advertising Expenditures (X) and Sales (Y) advertising expenditures on the x-axis and sales 140 on the y-axis. We notice that: 120 100 Sales 80 Larger (smaller) values of sales tend to be 60 associated with larger (smaller) values of 40 advertising. 20 0 0 10 20 30 40 50 A d ve rtising The scatter of points tends to be distributed around a positively sloped straight line. The pairs of values of advertising expenditures and sales are not located exactly on a straight line. The scatter plot reveals a more or less strong tendency rather than a precise linear relationship. The line represents the nature of the relationship on average. 10-8 Examples of Other Scatterplots 0 0 Y Y Y 0 0 X 0 X X Y Y Y X X X 10-9 Model Building The inexact nature of the Data In ANOVA, the systematic relationship between component is the variation advertising and sales of means between samples suggests that a statistical or treatments (SSTR) and model might be useful in Statistical the random component is analyzing the relationship. model the unexplained variation (SSE). A statistical model separates the systematic component Systematic In regression, the of a relationship from the systematic component is component random component. the overall linear + relationship, and the Random random component is the errors variation around the line. 10-10 10-2 The Simple Linear Regression Model The population simple linear regression model: Y= 0 + 1 X + Nonrandom or Random Systematic Component Component where Y is the dependent variable, the variable we wish to explain or predict X is the independent variable, also called the predictor variable is the error term, the only random component in the model, and thus, the only source of randomness in Y. 0 is the intercept of the systematic component of the regression relationship. 1 is the slope of the systematic component. The conditional mean of Y: E [Y X ] 0 1 X 10-11 Picturing the Simple Linear Regression Model Y Regression Plot The simple linear regression model gives an exact linear relationship between the expected or average value of Y, the dependent variable, and X, E[Y]=0 + 1 X the independent or predictor Yi variable: { Error: i } 1 = Slope E[Yi]=0 + 1 Xi } 1 Actual observed values of Y 0 = Intercept differ from the expected value by an unexplained or random error: X Yi = E[Yi] + i Xi = 0 + 1 Xi + i 10-12 Assumptions of the Simple Linear Regression Model • The relationship between X Assumptions of the Simple Y Linear Regression Model and Y is a straight-line relationship. • The values of the independent variable X are assumed fixed (not random); the only E[Y]=0 + 1 X randomness in the values of Y comes from the error term i. • The errors i are normally distributed with mean 0 and Identical normal variance 2. The errors are distributions of errors, all centered on the uncorrelated (not related) in regression line. successive observations. That is: ~ N(0,2) X 10-13 10-3 Estimation: The Method of Least Squares Estimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line. The estimated regression equation: Y = b0 + b1X + e where b0 estimates the intercept of the population regression line, 0 ; b1 estimates the slope of the population regression line, 1; and e stands for the observed errors - the residuals from fitting the estimated regression line b0 + b1X to a set of n points. The estimated regression line: Y b0 + b1 X where Y (Y - hat) is the value of Y lying on the fitted regression line for a given value of X. 10-14 Fitting a Regression Line Y Y Data Three errors from the least squares regression X line X Y Three errors Errors from the least from a fitted line squares regression line are minimized X X 10-15 Errors in Regression Y the observed data point Y b0 b1 X the fitted regression line Yi . Yi { Error ei Yi Yi Yi the predicted value of Y for X i X Xi 10-16 Least Squares Regression The sum of squared errors in regression is: n n SSE = e i=1 2 i (y i=1 i yi ) 2 The least squares regression line is that which minimizes the SSE with respect to the estimates b 0 and b 1 . The normal equations: SSE b0 n n y i=1 i nb0 b1 x i i=1 At this point SSE is Least squares b0 minimized n n n with respect x i y i b0 x i b1 x 2 i=1 i=1 i i=1 to b0 and b1 Least squares b1 b1 10-17 Sums of Squares, Cross Products, and Least Squares Estimators Sums of Squares and Cross Products: x 2 SSx (x x ) x 2 2 n 2 SS y ( y y ) y 2 2 y n SSxy (x x )( y y ) xy x ( y ) n Least squares regression estimators: SS XY b1 SS X b0 y b1 x 10-18 Example 10-1 Miles Dollars Miles 2 Miles*Dollars 2 x 2 1211 1345 1802 2405 1466521 1809025 2182222 3234725 SS x x 1422 2005 2022084 2851110 n 1687 2511 2845969 4236057 2 1849 2332 3418801 4311868 79 , 448 2026 2305 4104676 4669930 293 , 426 ,946 40 ,947 ,557 .84 2133 3016 4549689 6433128 25 2253 2400 3385 3090 5076009 5760000 7626405 7416000 x ( y ) 2468 3694 6091024 9116792 SS xy xy 2699 3371 7284601 9098329 n 2806 3998 7873636 11218388 (79 , 448 )(106 ,605 ) 3082 3209 3555 4692 9498724 10297681 10956510 15056628 390 ,185 ,014 51, 402 ,852 .4 3466 4244 12013156 14709704 25 3643 5298 13271449 19300614 3852 4801 14837904 18493452 SS 51, 402 ,852 .4 4033 5147 16265089 20757852 b XY 1.255333776 1.26 4267 5738 18207288 24484046 1 SS 40 ,947 ,557 .84 4498 6420 20232004 28877160 X 4533 6059 20548088 27465448 4804 6426 23078416 30870504 106 ,605 79,448 5090 6321 25908100 32173890 b y b x (1.255333776 ) 5233 7026 27384288 36767056 0 1 25 25 5439 6964 29582720 37877196 79,448 106,605 293,426,946 390,185,014 274 .85 10-19 Template (partial output) that can be used to carry out a Simple Regression 10-20 Template (continued) that can be used to carry out a Simple Regression 10-21 Template (continued) that can be used to carry out a Simple Regression Residual Analysis. The plot shows the absence of a relationship between the residuals and the X-values (miles). 10-22 Template (continued) that can be used to carry out a Simple Regression Note: The normal probability plot is approximately linear. This would indicate that the normality assumption for the errors has not been violated. 10-23 Y X 10-24 10-4 Error Variance and the Standard Errors of Regression Estimators Y Degrees of Freedom in Regression: df = (n - 2) (n total observations less one degree of freedom for each parameter estimated (b 0 and b1 ) ) 2 Square and sum all ( SS XY ) 2 regression errors to find SSE = ( Y - Y ) SSY SSE. SS X X = SSY b1SS XY Example 10 - 1: SSE = SS Y b1 SS XY 2 2 66855898 (1.255333776)( 51402852 .4 ) An unbiased estimator of s , denoted by S : 2328161.2 SSE 2328161.2 SSE MSE MSE = n2 23 (n - 2) 101224 .4 s MSE 101224 .4 318.158 10-25 Standard Errors of Estimates in Regression The standard error of b0 (intercept): Example 10 - 1: 2 s x s(b0 ) s(b0 ) s x2 nSS X nSS X 318.158 293426944 ( 25)( 4097557.84 ) where s = MSE 170.338 s s(b1 ) The standard error of b1 (slope): SS X 318.158 s s(b1 ) 40947557.84 0.04972 SS X 10-26 Confidence Intervals for the Regression Parameters A (1 - ) 100% confidence interval for b : 0 b t s (b ) Example 10 - 1 0 ,(n 2 ) 0 95% Confidence Intervals: 2 b t s (b ) 0 0.025,( 25 2 ) 0 A (1 - ) 100% confidence interval for b : = 274.85 ( 2.069) (170.338) 1 b t s (b ) 274.85 352.43 1 ,(n 2 ) 1 2 [ 77.58, 627.28] Least-squares point estimate: b1=1.25533 b1 t s (b1 ) 0.025,( 25 2 ) = 1.25533 ( 2.069) ( 0.04972 ) Height = Slope 1.25533 010287 . [115246,1.35820] . 0 (not a possible value of the Length = 1 regression slope at 95%) 10-27 Template (partial output) that can be used to obtain Confidence Intervals for 0 and 1 10-28 10-5 Correlation The correlation between two random variables, X and Y, is a measure of the degree of linear association between the two variables. The population correlation, denoted by, can take on any value from -1 to 1. 1 indicates a perfect negative linear relationship -1 < < 0 indicates a negative linear relationship 0 indicates no linear relationship 0<<1 indicates a positive linear relationship 1 indicates a perfect positive linear relationship The absolute value of indicates the strength or exactness of the relationship. 10-29 Illustrations of Correlation Y Y Y = -1 =0 =1 X X X Y = -.8 Y =0 Y = .8 X X X 10-30 Covariance and Correlation The covariance of two random variables X and Y: Cov ( X , Y ) E [( X )(Y )] X Y where and Y are the population means of X and Y respectively. X The population correlation coefficient: Example 10 - 1: Cov ( X , Y ) SS = XY r= SS SS X Y X Y 51402852.4 The sample correlation coefficient * : ( 40947557.84)( 66855898) SS 51402852.4 r= XY .9824 SS SS 52321943.29 X Y *Note: If < 0, b1 < 0 If = 0, b1 = 0 If > 0, b1 >0 10-31 Hypothesis Tests for the Correlation Coefficient Example 10 -1: r H0: = 0 (No linear relationship) t( n 2 ) H1: 0 (Some linear relationship) 1 r2 n2 0.9824 r = Test Statistic: t( n 2 ) 1 - 0.9651 1 r2 25 - 2 n2 0.9824 = 25.25 0.0389 t0. 005 2.807 25.25 H 0 rejected at 1% level 10-32 10-6 Hypothesis Tests about the Regression Relationship Constant Y Unsystematic Variation Nonlinear Relationship Y Y Y X X X A hypothesis test for the existence of a linear relationship between X and Y: H0: 1 0 H1: 1 0 Test statistic for the existence of a linear relationship between X and Y: b 1 t (n - 2) s(b ) 1 where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b . 1 1 1 When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom. 10-33 Hypothesis Tests for the Regression Slope Example 10 - 1: Example 10 - 4 : H0: 1 0 H : 1 0 1 H1: 1 0 H : 1 1 1 b b 1 1 1 t t (n - 2) s(b ) ( n - 2) s (b ) 1 1 1.24 - 1 1.25533 = 1.14 = 25.25 0.21 0.04972 t 1.671 1.14 t 2.807 25.25 (0.05,58) ( 0 . 005 , 23 ) H is not rejected at the 10% level. 0 H 0 is rejected at the 1% level and we may We may not conclude that the beta conclude that there is a relationship between coefficient is different from1. charges and miles traveled. 10-34 10-7 How Good is the Regression? The coefficient of determination, r2, is a descriptive measure of the strength of the regression relationship, a measure of how well the regression line fits the data. ( y y) ( y y ) ( y y) Y Total = Unexplained Explained Deviation Deviation Deviation Y . (Error) (Regression) Y Y Unexplained Deviation Explained Deviation { } { Total Deviation SST 2 ( y y) ( y y = SSE 2 ) ( y y ) + SSR Percentage of 2 2 SSR SSE r 1 total variation SST SST explained by X X the regression. 10-35 The Coefficient of Determination Y Y Y X X X SST SST SST S r2 = 0 SSE r2 = 0.50 SSE SSR r2 = 0.90 S SSR E 7000 Example 10 -1: 6000 5000 Dollars SSR 64527736.8 r 2 0.96518 4000 SST 66855898 3000 2000 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 Miles 10-36 10-8 Analysis-of-Variance Table and an F Test of the Regression Model Source of Sum of Degrees of Variation Squares Freedom Mean Square F Ratio Regression SSR (1) MSR MSR MSE Error SSE (n-2) MSE Total SST (n-1) MST Example 10-1 Source of Sum of Degrees of Variation Squares Freedom F Ratio p Value Mean Square Regression 64527736.8 1 64527736.8 637.47 0.000 Error 2328161.2 23 101224.4 Total 66855898.0 24 10-37 Template (partial output) that displays Analysis of Variance and an F Test of the Regression Model 10-38 10-9 Residual Analysis and Checking for Model Inadequacies Residuals Residuals 0 0 x or y x or y Homoscedasticity: Residuals appear completely Heteroscedasticity: Variance of residuals random. No indication of model inadequacy. increases when x changes. Residuals Residuals 0 0 Time x or y Curved pattern in residuals resulting from Residuals exhibit a linear trend with time. underlying nonlinear relationship. 10-39 Normal Probability Plot of the Residuals Flatter than Normal 10-40 Normal Probability Plot of the Residuals More Peaked than Normal 10-41 Normal Probability Plot of the Residuals Positively Skewed 10-42 Normal Probability Plot of the Residuals Negatively Skewed 10-43 10-10 Use of the Regression Model for Prediction • Point Prediction A single-valued estimate of Y for a given value of X obtained by inserting the value of X in the estimated regression equation. • Prediction Interval For a value of Y given a value of X Variation in regression line estimate Variation of points around regression line For an average value of Y given a value of X Variation in regression line estimate 10-44 Errors in Predicting E[Y|X] Y Upper limit on slope Y Upper limit on intercept Regression line Regression line Lower limit on slope Y Y Lower limit on intercept X X X X 1) Uncertainty about the 2) Uncertainty about the slope of the regression line intercept of the regression line 10-45 Prediction Interval for E[Y|X] Y Prediction band for E[Y|X] • The prediction band for E[Y|X] Regression line is narrowest at the mean value of X. Y • The prediction band widens as the distance from the mean of X increases. X X • Predictions become very unreliable when we Prediction Interval for E[Y|X] extrapolate beyond the range of the sample itself. 10-46 Additional Error in Predicting Individual Value of Y Y Regression line Y Prediction band for E[Y|X] Regression line Y Prediction band for Y X X X 3) Variation around the regression line Prediction Interval for E[Y|X] 10-47 Prediction Interval for a Value of Y A (1 - ) 100% prediction interval for Y : 1 (x x) 2 y t s 1 ˆ 2 n SS X Example 10 - 1 (X = 4,000) : 1 (4,000 3,177.92) 2 {274.85 (1.2553)(4,000)} 2.069 318.16 1 25 40,947,557.84 5296 .05 676.62 [4619 .43, 5972 .67] 10-48 Prediction Interval for the Average Value of Y A (1 - ) 100% prediction interval for the E[ Y X] : 1 (x x) 2 yt s ˆ 2 n SS X Example 10 - 1 (X = 4,000) : 1 (4,000 3,177.92) 2 {274.85 (1.2553)(4,000)} 2.069 318.16 25 40,947,557.84 5,296.05 156.48 [5139 .57, 5452 .53] 10-49 Template Output with Prediction Intervals 10-50 10-11 The Solver Method for Regression The solver macro available in EXCEL can also be used to conduct a simple linear regression. See the text for instructions. 10-51 10-12 Linear Composites of Dependent Random Variables • The Case of Independent Random Variables: For independent random variables, X1, X2, …, Xn, the expected value for the sum, is given by: • E(X1 + X2 + … + Xn) = E(X1) + E(X2)+ … + E(Xn) • For independent random variables, X1, X2, …, Xn, the variance for the sum, is given by: • V(X1 + X2 + … + Xn) = V(X1) + V(X2)+ … + V(Xn) 10-52 10-12 Linear Composites of Dependent Random Variables • The Case of Independent Random Variables with Weights: For independent random variables, X1, X2, …, Xn, with respective weights 1, 2, …, n, the expected value for the sum, is given by: • E(1 X1 + 2 X2 + … + n Xn) = 1 E(X1) + 2 E(X2)+ … + n E(Xn) For independent random variables, X1, X2, …, Xn, with respective weights 1, 2, …, n, the variance for the sum, is given by: • V(1 X1 + 2 X2 + … + n Xn) = 12 V(X1) + 22 V(X2)+ … + n2 V(Xn) 10-53 Covariance of two random variables X1 and X2 • The covariance between two random variables X1 and X2 is given by: • Cov(X1, X2) = E{[X1 – E(X1)] [X2 – E(X2)]} • A simpler measure of covariance is given by: • Cov(X1, X2) = SD(X1) SD(X2) where is the correlation between X1 and X2. 10-54 10-12 Linear Composites of Dependent Random Variables • The Case of Dependent Random Variables with Weights: For dependent random variables, X1, X2, …, Xn, with respective weights 1, 2, …, n, the variance for the sum, is given by: • V(1 X1 + 1 X2 + … + n Xn) = 12 V(X1) + 22 V(X2)+ … + n2 V(Xn) + 2 1 2Cov(X1, X2) + … + 2 n-1 nCov(Xn-1, Xn)