Document Sample

Regression Analysis (overview) Regression analysis is the idea of • analyzing a set of sample data and • establishing a relationship between two variables and • explaining how one variable is dependent upon the other • using this dependency to explain the population or for the prediction of future data (life expectancy example: see Excel) Regression Analysis (overview cont’d) Regression analysis is the idea of • analyzing a set of sample data and • establishing a relationship between two variables and • explaining how one variable is dependent upon the other • using this dependency to explain the population or for the prediction of future data Regarding the second item, we will study linear relationships. Regarding the third item, we can extend the analysis to show how one variable is dependent upon many variables. Linear relationships When two variables have a linear relationship, we call them X = independent variable (horizontal axis) Y = dependent variable (vertical axis) and describe the relationship as Y = a + b*X where For example, a = intercept X = Year Y = Male Life Expectancy b = slope a = -445.95 (see Excel-life expectancy) b = 0.2608 Linear Relationships (cont’d) Y = a + b*X, where a is the intercept and b is the slope The intercept a is the value of Y when X = 0, or the place on the vertical axis where the line crosses The slope b describes the pitch or slope of the line If b > 0, the line goes up from left to right; the variables have a positive relationship; “when X goes up, Y goes up” If b < 0, the line goes down from left to right; the variables have a negative relationship; “when X goes up, Y goes down” In regression, we will take two variables X and Y and determine the a and b that best describe the relationship between X and Y Our starting assumption is that Y is dependent on X Doing Regression Analysis 1. Do a scatter diagram on the relevant data 2. Add a trendline, displaying the equation as well as R2 3. Do the full regression analysis using Excel’s Tools > Data Analysis > Regression command, saving and plotting the residuals 4. Examine the residuals for individual information (and other trends) Method of Least Squares Think about trying to fit the line Y = a + b*X to the data, for whatever values of a and b that you wish How good is the fit? The smaller the residuals / errors the better X Y actual Y predicted Residual Residual2 2 3 5 + 2*2 = 9 3 - 9 = -6 36 5 -1 5 + 2*5 = 15 -1 - 15 = -16 256 36 + 256 = 292 SSE = Sum of Squares Error = amount of variability “not explained” by line Method of Least Squares (cont’d) The main goal of the least squares method is to choose a and b so that SSE is minimized Method of Least Squares (cont’d) SSR = Sum of Squares due to Regression = amount of variability explained by the regression SST = SSR + SSE total variability = explained variability + unexplained variability R2 = SSR / SST = percentage of variability in Y that is explained by X = “sample coefficient of determination” Because of the above equations and the definition of SST, minimizing SSE is the same as maximizing SSR and / or R2 R2 is not the best measure of fit Method of Least Squares (cont’d) A better measure of fit is… Standard error of the estimation, denoted by Se Se is the standard deviation of the residuals about their mean The smaller SSE, the smaller Se Practical Trends with R2 and Se When working with regression, you will often have to compare two separate regressions and try to determine which is a better model. In this case, look for higher R2 and lower Se Kalamazoo Widget Company. Kalamazoo produces its widgets in batches or lots, and wants to see how batch size is related to the labor hours required to produce a lot. The following data are provided: observation batch size labor hours 1 30 73 2 20 50 3 60 128 4 80 170 5 40 87 6 50 108 7 60 135 8 30 69 9 70 148 10 60 132 To use the regression package provided in EXCEL, you must enter the data from the last two columns of the Kalamazoo table into an EXCEL spreadsheet, Choose Data Analysis from the Tools menu (available only if the Analysis Toolpak Add-In has been installed), Once the analysis tools are listed, choose regression. The regression tool pops up a window in which you will be asked to input a range for your y variable and a range for your x variable. Before entering the ranges, remember that the y-variable is dependent and the x-variable is independent. For the Kalamazoo example, batch size is the independent, x, variable, and labor hours is the dependent, y, variable. See “KLMZOO” Spreadsheet labor hours versus batch size 180 160 140 120 Labor hours 100 80 Linear (Labor 60 hours) 40 20 0 Batch size 20 30 30 40 50 60 60 60 70 SUMMARY OUTPUT Regression Statistics Multiple R 0.99780139 R Square 0.995607613 Adjusted R Square 0.995058565 Standard Error 2.738612788 Observations 10 ANOVA df SS MS F Significance F Regression 1 13600 13600 1813.333 1.02E-10 Residual 8 60 7.5 Total 9 13660 Coefficients Standard Error t Stat P-value Upper 95.000% Lower 95% Upper 95%Lower 95.000% Intercept 10 2.502939448 3.995302 0.003976 4.228208 15.77179 4.228208 15.77179 X Variable 1 2 0.046966822 42.58325 1.02E-10 1.891694 2.108306 1.891694 2.108306 The equation for the best straight line fit to the Kalamazoo data, as determined by EXCEL, is: y = 10 + 2x, or labor hours = 10 + 2 * batch size. Interpretation a 10-hour setup that is independent of the batch size once production starts, widgets are produced at the rate of 2 per hour of labor. Take a Look at “regression example” In the Regression Excel File Point Estimates From Regression Point predictions are very easy to produce--- substitute the value of x for which we want a prediction into the regression equation. To estimate the number of labor hours required to produce a batch of 55 widgets @ Kalamazoo (see klmzo spreadsheet), compute: labor hours 10 2 55 120 . More about the Regression Output When you use the least squares method to perform a regression analysis, here is a very important fact: the mean of all residuals equals 0 More on the Regression Output (cont’d) In a regression, what does a slope coefficient of 0 mean? It means that X has no effect on Y at all The regression output gives a 95% confidence interval for the slope coefficient: • if 0 is outside the interval, then we are confident that the slope is not 0 • if 0 is inside the interval, then the slope may be 0 P-value (or Prob-value) is “the probability that 0 is the real slope coefficient” Generally speaking, to make inferences from the regression, you want the P-value to be small (less than 0.05) A good regression model has… • A high R2 value • A low Se value • Slope coefficient which is not likely to equal 0 – 0 is outside the 95% confidence interval – P-value is less than 0.05 So then what makes a bad model? A bad regression model has… • Residuals that are not very well distributed – overall • residual histogram should look normal – for each section of values of X • residual plot should not have any strong patterns • e.g., mean of section’s residuals should equal 0 (like entire model) – for each subcollection of non-biased sample data • e.g., mean of subcollection’s residuals should equal 0 (like entire model) – in other words, we want no systematic misprediction • The linear idea just doesn’t fit the situation – Do you have strong reason to believe that Y and X are not related by a line? A bad regression model has… (cont’d) • Is the distribution of residuals non-normal? • Is the linear model inappropriate? When one of these is “yes” and the difference from the ideal model is striking, then linear regression is probably not a good idea When one of these is “yes” but the difference from the ideal model is moderate, then maybe the linear regression can be improved… (see Excel, temperature, air passenger miles) Excel Temperature Model: Examining Residuals Sum/Avg of subcollections should be close to zero----- use pivot table & grouping to examine Residuals should be normally distributed: use pivot table/chart to create histogram of residuals Air Passenger Miles Model Does the histogram of residuals look like a normal distribution? What about the averages over subcollections of residuals? Multiple Linear Regression Multiple linear regression is very similar to simple linear regression except that the dependent variable Y is described by m independent variables X1, …, Xm Y = a + b1*X1 + … + bm*Xm (see Excel, artsy_regression.xls) Multiple Linear Regression (cont’d) Y = a + b1X1 + … + bmXm • Intercept a is the same • Slope bi is the change in Y given a unit change in Xi while holding all other variables constant • SST, SSE, SSR, and R2 are the same • Se is the same except Se = sqrt( SSE / (n - m - 1) ) • Slope coefficient C.I.s (one for each Xi) are the same • P-values (one for each Xi) are the same (see Excel, artsy_regression.xls) Dummy Variables Dummy variables are variables that only take on the values 0 or 1 can be used as a tool to put variables that do not have a natural numerical interpretation into a regression For example, sex variable is 0 for female, 1 for male For example, dummy variables can also be used to fix the problem we saw with GRADE in the Artsy example…. Dummy Variables (cont’d) Artsy example: introduce a variable Gk for each grade, 1 if in grade, 0 if not Grade / G1 G2 G3 G4 G5 G6 G7 G8 Variable Grade 1 1 0 0 0 0 0 0 0 Grade 2 0 1 0 0 0 0 0 0 Grade 3 0 0 1 0 0 0 0 0 Grade 4 0 0 0 1 0 0 0 0 Grade 5 0 0 0 0 1 0 0 0 Grade 6 0 0 0 0 0 1 0 0 Grade 7 0 0 0 0 0 0 1 0 Grade 8 0 0 0 0 0 0 0 1 Dummy Variables (cont’d) You must choose a “base case”; in this case Grade 1 is a good base case, which means we should delete the variable G1 Grade / G2 G3 G4 G5 G6 G7 G8 Variable Grade 1 0 0 0 0 0 0 0 Grade 2 1 0 0 0 0 0 0 Grade 3 0 1 0 0 0 0 0 Grade 4 0 0 1 0 0 0 0 Grade 5 0 0 0 1 0 0 0 Grade 6 0 0 0 0 1 0 0 Grade 7 0 0 0 0 0 1 0 Grade 8 0 0 0 0 0 0 1 Dummy Variables (cont’d) • Add these variables to your regression in place of GRADE • Rerun regression • New regression eliminates earlier problems (see Excel, artsy_regression.xls) Why not add ALL dummy variables? Doing so creates Linear Dependency problems See Salary Data Spreadsheet: What happens when you include all the dummy variables? A Basic Technique for Multiple Linear Regression Regression analysis is all about trying to determine what variables have an effect on a single variable In other words, you are trying to describe Y in terms of X1, …, Xm Some of the variables may not have a significant effect (which corresponds to a high P-value for the corresponding slope coefficient) Efficiency: Adding in variables is okay, but remove the ones that don’t have a significant effect on Y Rough Flow Chart for Regression Do regression high P-value? residuals okay but not great? Remove residuals very bad or variables you feel “linear” is not right? Examine residuals for evidence of systematic Consider alternatives misprediction to linear regression evidence found? repeat Consider how to improve the model (e.g., add variables) When comparing different models also take into account changes in R2 and Se

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 3 |

posted: | 10/7/2012 |

language: | English |

pages: | 33 |

OTHER DOCS BY dffhrtcv3

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.