Document Sample

Dear Stat II students We have started learning an important topic called regression. This is a vast topic and very useful one. The aim of term paper assignment is to teach with a first-hand example the following: How to develop a statistical study of some important public issue. To get started, here are some fun sites you can visit to learn stats http://www.stat.sc.edu/~west/javahtml/Regression.html http://www.stat.sc.edu/~west/javahtml/ConfidenceInterval.html A software called R is freely available at http://www.r-project.org/ similar to S-plus TERM-PAPER ASIGNEMNT: How are we supposed to get the data? short answer is use the Internet. The R software also comes with all kinds of data There are economic data in R package called Ecdat. Almost all R packages ( there are some 2000 of them) have some illustrative data. This should not be hard for your generation of Internet savvy students. Go to the web, see what is available by way of data. I have created several convenient links to several data source for you on my web page. You can click on the sites referred on my web page for undergraduates on the right side. Remember you will need numerical data for Y, X and Z. The length of data series (or sample size, n) must be at least 25 and no more than 125. The data can be cross sectional or time series. The time points must match, that is 1970 (year, quarter or month) for Y, X and Z must match. If you get data from http://www.economagic.com I understand that they have a way of converting monthly to quarterly or annual etc. Of course, your data will have to be consistent. Do not use categorical data (0, 1; male, female; high school, college, post-grad; etc.). Such data require special handling, which is beyond our scope at this time. Pick some three variables that are related in the sense that X somehow causes Y and Z also causes Y. Choose one of them (Y) as the dependent variable (representing the effect of something) and X and Y are two causes. The causes have to be distinct. PART 1 due date immediately after the spring break, if not before. The title (describes the study) Name of author Abstract (at least 100 words state names of y x and z , data source and n=sample size) PART 2 Final term paper with self grading sheet filled is Due before Easter Break Also see http://www.fordham.edu/economics/vinod/announcements.doc Complete paper should be typed and pages should be stapled together. 1) Paper should have a title. (e.g. Effect of Horsepower and Weight on Fuel Economy of a Car This should be in all capital letters or as shown here with first letter capitalized, except for words like on and a, etc. Centered on the page as the first line. Use a larger font (say size 14 or 16) and Boldface it . You just highlight and press buttons for centering. and B for boldface. 2) Author’s name, affiliation and E-mail address: This should be in small font, centered. (e.g., John Doe, CBA, Fordham University, Bronx, NY 10458, E-mail doe@fordham.edu Please insert a footnote on the word CBA (or Fordham College) by clicking on insert menu and then on footnote. Then type: This term paper was written on March 13, 2002 in partial fulfillment of Statistical Decision Making course by Prof. Vinod. The following line should be Centered ABSTRACT A short abstract (less than 100 words) should describes the issue, and state your choice of y, X and Z variables. Name the data source. (I want you to keep it simple and have no more than y, X and Z variables.) I want you to use data from business and economics only. There should be sentences explaining why it is an interesting problem inviting the reader to read on. Make Abstract attractive for the reader. Do not use informal expressions like I used your website. This is not a letter from a student to the teacher. It should be addressed to the general readership. In the above example, the issue is fuel economy of cars, why is it important to reduce dependence on foreign oil. (Y=mpg, X=horsepower, Z=weight) 1. Introduction (Number the sections and use bold face) This section introduces the subject matter to the reader. It can repeat some material from the Abstract, perhaps with greater expansion. Explain what problem you are addressing, Explain why regression can help resolve the issue. Why should anyone care about predicting Y from X and Z? Why is it interesting? What is the problem? For example you could say that one suspects that y is related to x and z but does not know exactly how? This research explains the problem and offers a solution. If there is literature related to your project, you may include a literature review here. The list of references will then be needed at the end of the paper with list in alphabetical order. Title, journal name, volume, pages etc. all are needed there. For books, the publisher name, city, year etc. are needed. 2 The Model and the Data: In this section note that you are proposing a linear regression model to address the problem stated above. The population relation is denoted by Y = 0 + 1 X + 2 Z+ You can get beta symbol from the insert menu by clicking on the symbol. Type the above line and then make 0, 1 and 2 subscripts by highlighting and then pressing Ctrl and = sign at the same time. Y = 0 + 1 X + 2 Z+ This formula represents true population regression equation, where denotes the unknown errors in the model. After we run the regression (by R) the estimated model is denoted by: Y = b0 + b1 X + b2 Z + residual, Note that the coefficients in Greek notation are replaced by b to suggest that these are estimated values. True errors are replaced by “residuals” which are output by R. reg1=lm(y~x+z)#creates an object “reg1” for regression of y on x and z summary(reg1)#prints basic results of regression resid(reg1)#prints all residuals plot(reg1) #gives sophisticated plots of residuals anova(reg1)#does analysis of variance confint(reg1) #prints confidence intervals for coefficients Now describe the data carefully and fully. Data Sources should be mentioned here. (e.g. URL web sites. Bring the data itself as a machine readable file on a diskette if you can!). Describe your Y, X and Z in order. Units of measurement: When numbers measured in dollars are converted into millions of dollars (divide by 1000000) this is called a change in the units of measurement. Excel is not known to have excellent numerical accuracy. Compute the averages for all _ _ _ variables. For example, if X = 299993.12, Y =0.00456, and Z =20.3556, the orders of magnitude of these numbers are too disparate. We need to change the units of measurement as follows. Multiply all Y values by 10000 and divide all X values by _ _ 10000. This will change the averages to X = 29.999312 and Y =45.6. When you change the units this way you must remember to note the fact of changed units when you discuss the results. Basic Descriptive statistics: In R use the following two commands library(fBasics) basicStats(cbind(y,x,z)). Outlier Detection for Y, X and Z: Whether all observations are essentially similar to each other or not is an important question. Detection of outliers in the raw data can sometimes reveal data errors. We use a detection method, which does not use the standard deviation, since the standard deviation itself is notoriously sensitive to outliers. Denoting first quartile by Q1 and third quartile Q3. Now compute inter quartile range: IQR=Q3Q1 or the difference between quartiles. Next, define the two outlier detection limits as: LOW=Q1 – 1.5*IQR UPPer= Q3 + 1.5* IQR If any observation is below the lower limit or above the upper limit (separately for the variables Y, X and Z) then there is an outlier. I have written an R function called get.outlier to do this. This was described in class. I expect your term paper to have a total of four figures describing two basic relations in your data as follows. Figure 1 plots the dependent variable Y on vertical axis against regressor variable X on horizontal axis. A figure should have a title. e.g.: Figure 1: Scatter plot of Fuel Economy of a Car versus Horsepower Read the web page on how to do scatter plots. You will lose points if y is not on the vertical axis. In the discussion of scatter plots you should mention correlation coefficients and note for example that the correlation is positive as confirmed by the direction of the scatter plot in Figure 1 (say). Also draw Figure 2 plots the dependent variable Y on vertical axis against the second regressor variable Z on horizontal axis. In this (second) section, after describing the model Describe the fitted regression equation with numbers as follows. For example you might say my fitted equation is: Y = 0.0123 + 2.3345 X 0.3567 Z + residual or that the equation for prediction of Y values from X and Z is: ^ Y = 0.0123 + 2.3345 X 0.3567 Z This should be centered. The minus sign “-“ in MS-Word is too small. Use “insert” of MS-WORD and insert a symbol which looks like “”. This formula represents the fitted regression equation. It says that for any reasonable values of X and Z the fitted value of ^ Y is given by Y =0.0123 + 2.3345 X 0.3567 Z. When we write this formula for the fitted equation, the symbols X and Z are just symbols for the names of the variables. Do not try to further look for X and Z values. They are generic values given by the user of the equation. In writing the above equation I have given numerical values of estimated coefficients b0,= 0.0123 =intercept, b1 = 2.3345 for the first slope and b2,=- 0.3567 for the second slope (i.e., regression coefficients.) These come from R output of the command: summary(reg1) The t Stat in this table are going to be used later. They are simply the ratio of coefficient to corresponding standard error. Roughly speaking, if the t-Stat is larger than 2 it is a statistically significant regressor. Later we make a more formal test by consulting t tables or from the P-values. The Coefficients column gives the regression intercept and slope coefficients b1 and b2. The slope b1 measures the change in Y from a unit change in X. The slope b2 measures the change in Y from a unit change in Z. R-square tells about the overall fit. All such information comes from R output of the summary(reg1) command Regression Statistics R Square 0.712956 Adjusted R Square 0.690876 Standard Error 1.094835 Observations 29 In your paper, please write the numbers from your computer output in place of b0, b1 and b2 Report and discuss the corresponding Student’s t values? Report the absolute values of t. (i.e., you can ignore the sign). List the R-square and other output. What is the meaning of R-square in your context? The Multiple R when squared gives the R Square. Adjusted R Square = 1[(n1)/(nk)](1R2) where k=number of slope coefficients estimated (in your case k=3 since we estimate slopes b1 and b2 . Adj-R2 is always a bit smaller than R2 and is designed to penalize for having too many right hand side variables. In R software reg1=lm(y~x+z); summary(reg1); plot(reg1) create and print regression results and give all kinds of sophisticated plots for residuals. 3. Statistical Testing of the Model: In this third section you should discuss three issues: Is X a good regressor? Is Z a good regressor? Are X and Z together good for the model? We answer these issues with the help of the formal statistical tools learned in this course. You have to use two t-tests on the two regression slope coefficients b1 and b2 and also report results of one F test to earn full credit. First t-Test has the null hypothesis: 1=0, (one slope coefficient at a time) What is the two-sided alternative? (Ans: 1 0). Look up the tabulated critical value from t-table at 95% level and check if the observed (calculated) t value is closer to zero or in the tail area. If it is in the tail area we reject the null and conclude that variable X has a statistically significant effect on Y. Look up t table for 2-sided case such that one of the two tails contains area=0.025=/2. Note that is Type I error or 5% or 0.05 hence half of it is 0.025. If you reject the null it means true unknown 1 is not zero! or that the slope is NOT flat, i.e. X is an important independent (i.e. regressor) variable. Second t-Test is similar with the null H0: 2=0 against Ha 2 0 Your discussion will consist of checking if the selected variables are important in statistical sense and how important. You will discuss what the results of regression tell about the power to explain Y from X and Z, why they are important or not important. The F test is in the output of Excel under Analysis of Variance (ANOVA) but you are not allowed to use excel for your term paper. In R the command is: anova(reg1) if reg1 is the name of regression object. ANOVA df SS MS F Significance F Regression 2 77.40785058 38.70392529 32.2892313 8.98109E-08 Residual 26 31.16525287 1.198663572 Total 28 108.5731034 . This give the “calculated or observed F in the column market F. We need to compare this with the tabled value of F from the F probability distribution (F for Fisher). To look up this tabled value you need to know the column and row where to look. The columns and rows are called numerator degrees of freedom associated with the Regression sum of squares (i.e., 2 in above table) and denominator df associated with Residual (i.e., 26 in the above table). First, which F table to use? This will be the 5 percent Table, Area in the tail of F distribution is 0.05. In the above example tabulated value from page 536 of your textbook under column for 2 (numerator df along the row labeled Regression) and row for 26 (denominator df along the row labeled Residual) is 3.37. If tabled F value is smaller than observed or calculated value in the ANOVA table, the overall model is good. In my example 3.37 is much smaller than 32.2892313 in the column entitled F. Hence the calculated F falls in the tail area. So we reject the null hypothesis that all slopes are zero. (Your k=2 since you have two regressors X and Z) H0: 1 =2 = =k =0 (All slopes are zero or flat at the same time) Ha: At least some i is nonzero. (The model is worthwhile) SSregression / k MSR Compute F(k, n-k-1) = = SSerrrors /[n (k 1)] MSE If observed F > Tabulated F, Reject the null H0 Even if the tests on individual 1 or 2 lead to the conclusion that one is significant by the t test, we are not looking at individual tests here. F test is for the model as a whole. It may well happen that the we may accept the null of the F test and at the same time accept one regressor as significant. This is an apparent contradiction between the results of the F test and individual t tests. This is why theorists ask us to look at F test, as well as, individual t tests to get a complete picture. The MSR and MSE refer to the numbers from the above ANOVA Table. Look in column entitled MS. The numbers in MS are simply ratio of SS and df. In R the anova command gives the P values, so we need not look up F tables. In short, SPECIFICALLY: In the discussion of the paper I will look for the value of adjusted R square. The larger the R-square, the better the model. It is never larger than unity. The discussion should mention F statistic value and its P value. The discussion should also include individual regression coefficients their t-values, P- values and 95 percent confidence intervals. If the P-value is smaller than 0.05 the regression coefficient is said to be statistically significant, that is the particular variable have a significant effect on the dependent variable. You should discuss this for both variables. Figure 3 is a plot of regression residuals on vertical axis against X and Figure 4 is a similar plot of regression residuals against Z. Where do you get the “residuals” from?. In R if reg1=lm(y~x+z) creates the regression object, then resid(reg1) will contain the residuals. You can use plot(x,resid(reg1)). These plots suggest the “lack of fit”. Residuals scattered all over without any particular pattern is good. If they show a parabolic pattern, it suggests that a nonlinear equation may have been a better model. If one has time series data and if the adjacent residuals are close to each other, (autocorrelated) this violates one of the assumptions of the regression model. Similarly if the residuals systematically fan out they may be heteroscedastic and fail to satisfy another assumption. Advanced statistical testing for such properties is beyond the scope of undergraduate students. However, you can do outlier detection for the residuals. If there are outliers in residuals it means that the model fits very poorly at those observations. In R software, anova(reg1) prints the analysis of variance table ^ Figure 5 is an optional plot of y (predicted value of Y) on the vertical axis against original Y data itself on horizontal axis. In R the command is yhat=fitted(reg1) If this plot looks like a 45-degree line (assuming that the scale is comparable along vertical and horizontal axes) you have a good model. Conclusion section should have a punch line saying what one learns from the study. Do not attach entire computer printouts. I need to see the output of summary(reg1) anova(reg1) etc only. Discuss your output in few words. Shrink the figures on a Xerox machine if necessary. All figures must have axes labeled. Be sure to use the scatter plot figures. Do read class notes posted on the web in light of this paper. You will get some ideas and learn the material better. Illustrative term papers can be followed for guidance. 4. Conclusion and Final Remarks Final section has this title and discuss your conclusion and your third figure belongs here. You indicate future work here. For example, if there are outliers you can say that in future work you will try to get data without outliers or delete the entire observation containing the outlier and fit the regression over again. Similarly, if the plots suggest that a non-linear model might have been better, say so here. You also give the punch line here stating what is interesting about the study and what evidence did you find. EXAMPLE: The effects of gasoline prices and average insurance rates on the purchasing of sport automobiles by consumers over a five year period. X1 is going to be the gasoline price and X2 is going to be the insurance rate in the year respectively with the gasoline prices. The obvious hypothesis is that when gasoline prices and insurance prices rise the amount of cars purchased will drop. X1=Gasoline Price, X2=Insurance Rates Y=Amount of cars Purchased EXAMPLE: The experience of one of your classmates might help you. Mr. D wanted to study recessions and wanted to relate recession to unemployment and interest rates. I told him that recession is either present or absent. This means that it is a categorical variable (0 or 1) type. Your y variable cannot be a categorical variable, it has to be a bunch of numbers. I told him that growth rate of GDP can be a proxy real variable for recession. Growth rate = proportionate change= y Y= (GDP in quarter # 2 MINUS GDP in previous quarter)/ GDP in previous quarter X or X1 = interest rate Z or X2 = unemployment rate We went to www.economagic.com and We clicked on most frequently asked series. Whoa! Annualized growth in GDP is actually already calculated there. You just need at least 25 data points The data can be quarters, months, days, whatever. EXAMPLE: Another student used corruption data link on my web page to pick Y=GNP per capita X= adult literacy rate Z=index of economic freedom Not all countries have data on these things. She has to select countries that have data on all three things. EXAMPLE: from “finance.Yahoo.com” website pick 30 ticker symbols for any companies of interest to you. Let y= earnings per share (EPS), x= market capitalization, z=wall street beta as a measure of risk of the stock. To get credit for all your work, attach this to the paper and claim all you have done. It will be audited by one of you and then I will verify it. 3-part Checklist to be attached at the front of your paper and self-grading For exact due dates see http://www.fordham.edu/economics/vinod/announcements.doc PART 1 your name and title of the paper. 1) Is the title sufficiently descriptive? Is it typed centered, bold, properly capitalized, in large font and along the top line? 0.5 points. I claim full credit for this 2) Is there your Name, affiliation and E-mail along second and third line? 0.5 points. I claim full credit for this 3) Is there the (*) Footnote with date, the phrase “partial fulfillment,” etc. 0.5 points. I claim full credit for this 4) Is the Abstract LESS THAN 125 words and descriptive of the model? (Please do give data sources in the initial version due couple of weeks before the Easter break but delete it from the final version of the abstract as attached to the paper due date just before the break, where the data information goes in the data appendix) 2.5 points. I claim full credit for this 5) Are there exactly four numbered sections with bold titles present? How many pages? (Cut a point if raw data are NOT attached in the program and data appendix). 0.5 points. I claim full credit for this Points earned in PART 1 so far are: Out of 4.5 . PART 2 6) Are data sources mentioned? 0.5 points I claim full credit for this 7) Are descriptive stats including correlation coefficients mentioned? 0.5 points I claim full credit for this 8) Is outlier detection done for Y, X and Z variables? 1 point I claim full credit for this 9) Are 4 figures present? Fig 1 has Y vertical &X horizontal (Y vs. X), Fig 2 has Y vs. Z, Fig 3 has residuals vs X and Fig.4 has residuals vs. Z. Note that if the residuals do not show any particular pattern, the model as a whole is good. 0.5 points I claim full credit for this 10) Are the figures labeled correctly? 0.5 points I claim full credit for this 11) Are the Correct variables represented on the vertical axis ? 0.5 points I claim full credit for this 12) Are all four figures explicitly discussed in the text? 0.5 points I claim full credit for this Points earned in PART 2 so far are: Out of 4 PART 3 13) Is regression equation with equation numbers reported? e.g.: Y = 0.0123 + 2.3345 X 0.3567 Z + residual (numbers will be different for each) 14) Is it centered? It should be centered.0.5 points I claim full credit for this 15) Are there two t tests ? Did the student correctly look up critical t values from the t table in the textbook or from R functions? 16) The NULL and ALTERNATIVE hypotheses should be correctly stated for each test. 1.5 points I claim full credit for this 17) Is there one F test? Did the student correctly look up critical values for F from the F table in the textbook? 1 point I claim full credit for this 18) Are the t tests and F tests discussed somewhere in the text? 0.5 points I claim full credit for this 19) You should mention possible future work to improve the model in the conclusion section. Is there a punch line for conclusion? Usually this will comprise the gist of the following 3 things: The X (spell out the name) variable has (has not) a statistically significant effect on Y (Spell out name). The Z (spell out the name) variable has (has not) a statistically significant effect on Y (Spell out name). The overall model is (is not) statistically significant. Note that possible future work is not called the punch line. 1 point I claim full credit for this Total Points earned in PART 3 are: Out of 4.5 Grand Total of All points claimed are: Out of 13. Auditor to state his itemized scores in red pen and sign. Grand total of points given by the auditor are: Out of 13 If you claim more points than you deserve after I audit your term paper, you will lose at least twice as many points as the task is worth. Audited by: write the Name in all caps and signature of the Auditor student. Careful audit (finding errors or omissions) gets the auditor extra credit points. I believe that critical reading of other people’s work is an important part of your training. So go through the checklist carefully.

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 5 |

posted: | 4/29/2010 |

language: | English |

pages: | 10 |

OTHER DOCS BY decree

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.