Dear Stat I students

Document Sample
Dear Stat I students Powered By Docstoc
					Dear Stat II students

We have started learning an important topic called regression. This is a vast topic and
very useful one. The aim of term paper assignment is to teach with a first-hand example
the following: How to develop a statistical study of some important public issue.

To get started, here are some fun sites you can visit to learn stats

A software called R is freely available at similar to S-plus


How are we supposed to get the data?
short answer is use the Internet. The R software also comes with all kinds of data
There are economic data in R package called Ecdat. Almost all R packages ( there are
some 2000 of them) have some illustrative data. This should not be hard for your
generation of Internet savvy students.
Go to the web, see what is available by way of data. I have created several convenient
links to several data source for you on my web page. You can click on the sites referred
on my web page for undergraduates on the right side. Remember you will need numerical
data for Y, X and Z. The length of data series (or sample size, n) must be at least 25 and
no more than 125. The data can be cross sectional or time series. The time points must
match, that is 1970 (year, quarter or month) for Y, X and Z must match. If you get data
from I understand that they have a way of converting
monthly to quarterly or annual etc. Of course, your data will have to be consistent. Do
not use categorical data (0, 1; male, female; high school, college, post-grad; etc.). Such
data require special handling, which is beyond our scope at this time.

Pick some three variables that are related in the sense that X somehow causes Y and Z
also causes Y. Choose one of them (Y) as the dependent variable (representing the effect
of something) and X and Y are two causes. The causes have to be distinct.

PART 1 due date immediately after the spring break, if not before.

The title (describes the study)
Name of author
Abstract (at least 100 words state names of y x and z , data source and n=sample size)

PART 2 Final term paper with self grading sheet filled is Due before Easter Break
Also see
Complete paper should be typed and pages should be stapled together.
1) Paper should have a title. (e.g.
       Effect of Horsepower and Weight on Fuel Economy of a Car
This should be in all capital letters or as shown here with first letter capitalized, except
for words like on and a, etc. Centered on the page as the first line. Use a larger font (say
size 14 or 16) and Boldface it . You just highlight and press buttons for centering. and B
for boldface.
2) Author’s name, affiliation and E-mail address: This should be in small font, centered.
                 John Doe, CBA, Fordham University, Bronx, NY 10458,

Please insert a footnote on the word CBA (or Fordham College) by clicking on insert
menu and then on footnote. Then type: This term paper was written on March 13, 2002
in partial fulfillment of Statistical Decision Making course by Prof. Vinod.

The following line should be Centered
A short abstract (less than 100 words) should describes the issue, and state your choice of
y, X and Z variables. Name the data source. (I want you to keep it simple and have no
more than y, X and Z variables.) I want you to use data from business and economics
only. There should be sentences explaining why it is an interesting problem inviting the
reader to read on. Make Abstract attractive for the reader. Do not use informal
expressions like I used your website. This is not a letter from a student to the teacher. It
should be addressed to the general readership.

In the above example, the issue is fuel economy of cars, why is it important to reduce
dependence on foreign oil. (Y=mpg, X=horsepower, Z=weight)

1. Introduction (Number the sections and use bold face)
This section introduces the subject matter to the reader. It can repeat some material from
the Abstract, perhaps with greater expansion. Explain what problem you are addressing,
Explain why regression can help resolve the issue. Why should anyone care about
predicting Y from X and Z? Why is it interesting? What is the problem? For example
you could say that one suspects that y is related to x and z but does not know exactly
how? This research explains the problem and offers a solution.

If there is literature related to your project, you may include a literature review here. The
list of references will then be needed at the end of the paper with list in alphabetical
order. Title, journal name, volume, pages etc. all are needed there. For books, the
publisher name, city, year etc. are needed.

2 The Model and the Data:
In this section note that you are proposing a linear regression model to address the
problem stated above. The population relation is denoted by
Y = 0 + 1 X + 2 Z+
You can get beta symbol from the insert menu by clicking on the symbol.
Type the above line and then make 0, 1 and 2 subscripts by highlighting and then
pressing Ctrl and = sign at the same time.
Y = 0 + 1 X + 2 Z+  This formula represents true population regression equation,
where  denotes the unknown errors in the model.
After we run the regression (by R) the estimated model is denoted by:
Y = b0 + b1 X + b2 Z + residual,
Note that the coefficients in Greek notation  are replaced by b to suggest that these are
estimated values. True errors are replaced by “residuals” which are output by R.
reg1=lm(y~x+z)#creates an object “reg1” for regression of y on x and z
summary(reg1)#prints basic results of regression
resid(reg1)#prints all residuals
plot(reg1) #gives sophisticated plots of residuals
anova(reg1)#does analysis of variance
confint(reg1) #prints confidence intervals for coefficients

Now describe the data carefully and fully. Data Sources should be mentioned here. (e.g.
URL web sites. Bring the data itself as a machine readable file on a diskette if you can!).
Describe your Y, X and Z in order.
Units of measurement: When numbers measured in dollars are converted into millions
of dollars (divide by 1000000) this is called a change in the units of measurement. Excel
is not known to have excellent numerical accuracy. Compute the averages for all
                           _                 _                _
variables. For example, if X = 299993.12, Y =0.00456, and Z =20.3556, the orders of
magnitude of these numbers are too disparate. We need to change the units of
measurement as follows. Multiply all Y values by 10000 and divide all X values by
                                           _                   _
10000. This will change the averages to X = 29.999312 and Y =45.6. When you change
the units this way you must remember to note the fact of changed units when you
discuss the results.

Basic Descriptive statistics: In R use the following two commands

Outlier Detection for Y, X and Z:
Whether all observations are essentially similar to each other or not is an important
question. Detection of outliers in the raw data can sometimes reveal data errors. We use
a detection method, which does not use the standard deviation, since the standard
deviation itself is notoriously sensitive to outliers.
Denoting first quartile by Q1 and third quartile Q3. Now compute inter quartile range:
IQR=Q3Q1 or the difference between quartiles. Next, define the two outlier detection
limits as:
LOW=Q1 – 1.5*IQR
UPPer= Q3 + 1.5* IQR
If any observation is below the lower limit or above the upper limit (separately for the
variables Y, X and Z) then there is an outlier.
I have written an R function called get.outlier to do this. This was described in class.

I expect your term paper to have a total of four figures describing two basic relations in
your data as follows.
Figure 1 plots the dependent variable Y on vertical axis against regressor variable X on
horizontal axis. A figure should have a title.
e.g.: Figure 1: Scatter plot of Fuel Economy of a Car versus Horsepower
Read the web page on how to do scatter plots. You will lose points if y is not on the
vertical axis. In the discussion of scatter plots you should mention correlation
coefficients and note for example that the correlation is positive as confirmed by the
direction of the scatter plot in Figure 1 (say).

Also draw
Figure 2 plots the dependent variable Y on vertical axis against the second regressor
variable Z on horizontal axis.
In this (second) section, after describing the model Describe the fitted regression
equation with numbers as follows. For example you might say my fitted equation is:
                       Y = 0.0123 + 2.3345 X  0.3567 Z + residual
or that the equation for prediction of Y values from X and Z is:
                             Y = 0.0123 + 2.3345 X  0.3567 Z

This should be centered. The minus sign “-“ in MS-Word is too small. Use “insert” of
MS-WORD and insert a symbol which looks like “”. This formula represents the fitted
regression equation. It says that for any reasonable values of X and Z the fitted value of
Y is given by Y =0.0123 + 2.3345 X  0.3567 Z. When we write this formula for the
fitted equation, the symbols X and Z are just symbols for the names of the variables. Do
not try to further look for X and Z values. They are generic values given by the user of
the equation. In writing the above equation I have given numerical values of estimated
coefficients b0,= 0.0123 =intercept, b1 = 2.3345 for the first slope and b2,=- 0.3567 for
the second slope (i.e., regression coefficients.) These come from
R output of the command: summary(reg1)

The t Stat in this table are going to be used later. They are simply the ratio of coefficient
to corresponding standard error. Roughly speaking, if the t-Stat is larger than 2 it is a
statistically significant regressor. Later we make a more formal test by consulting t tables
or from the P-values.

The Coefficients column gives the regression intercept and slope coefficients b1 and b2.
The slope b1 measures the change in Y from a unit change in X. The slope b2 measures
the change in Y from a unit change in Z.

R-square tells about the overall fit. All such information comes from R output of the
summary(reg1) command

    Regression Statistics
R Square            0.712956
Adjusted R Square 0.690876
Standard Error    1.094835
Observations            29
In your paper, please write the numbers from your computer output in place of b0, b1 and
b2 Report and discuss the corresponding Student’s t values? Report the absolute values
of t. (i.e., you can ignore the sign). List the R-square and other output. What is the
meaning of R-square in your context? The Multiple R when squared gives the R Square.
Adjusted R Square = 1[(n1)/(nk)](1R2) where k=number of slope coefficients
estimated (in your case k=3 since we estimate slopes b1 and b2 . Adj-R2 is always a bit
smaller than R2 and is designed to penalize for having too many right hand side variables.
In R software reg1=lm(y~x+z); summary(reg1); plot(reg1) create and print
regression results and give all kinds of sophisticated plots for residuals.
3. Statistical Testing of the Model:
In this third section you should discuss three issues: Is X a good regressor? Is Z a good
regressor? Are X and Z together good for the model? We answer these issues with the
help of the formal statistical tools learned in this course. You have to use two t-tests on
the two regression slope coefficients b1 and b2 and also report results of one F test to
earn full credit.

First t-Test has the null hypothesis: 1=0, (one slope coefficient at a time) What is the
two-sided alternative? (Ans: 1 0). Look up the tabulated critical value from t-table at
95% level and check if the observed (calculated) t value is closer to zero or in the tail
area. If it is in the tail area we reject the null and conclude that variable X has a
statistically significant effect on Y.
Look up t table for 2-sided case such that one of the two tails contains area=0.025=/2.
Note that  is Type I error or 5% or 0.05 hence half of it is 0.025.
If you reject the null it means true unknown 1 is not zero! or that the slope is NOT flat,
i.e. X is an important independent (i.e. regressor) variable.

Second t-Test is similar with the null H0: 2=0 against Ha 2  0

Your discussion will consist of checking if the selected variables are important in
statistical sense and how important. You will discuss what the results of regression tell
about the power to explain Y from X and Z, why they are important or not important.

The F test is in the output of Excel under Analysis of Variance (ANOVA) but you are
not allowed to use excel for your term paper.
In R the command is: anova(reg1) if reg1 is the name of regression object.
                df           SS               MS            F             Significance F
Regression            2   77.40785058       38.70392529 32.2892313               8.98109E-08
Residual             26   31.16525287       1.198663572
Total                28   108.5731034
This give the “calculated or observed F in the column market F. We need to compare this
with the tabled value of F from the F probability distribution (F for Fisher). To look up
this tabled value you need to know the column and row where to look. The columns and
rows are called numerator degrees of freedom associated with the Regression sum of
squares (i.e., 2 in above table) and denominator df associated with Residual (i.e., 26 in
the above table). First, which F table to use? This will be the 5 percent Table, Area in the
tail of F distribution is 0.05. In the above example tabulated value from page 536 of your
textbook under column for 2 (numerator df along the row labeled Regression) and row
for 26 (denominator df along the row labeled Residual) is 3.37.

If tabled F value is smaller than observed or calculated value in the ANOVA table, the
overall model is good. In my example 3.37 is much smaller than 32.2892313 in the
column entitled F. Hence the calculated F falls in the tail area. So we reject the null
hypothesis that all slopes are zero. (Your k=2 since you have two regressors X and Z)
H0: 1 =2 = =k =0 (All slopes are zero or flat at the same time)
Ha: At least some i is nonzero. (The model is worthwhile)
                         SSregression / k       MSR
Compute F(k, n-k-1) =                         =
                      SSerrrors /[n  (k  1)] MSE
If observed F > Tabulated F, Reject the null H0
Even if the tests on individual 1 or 2 lead to the conclusion that one is significant by the
t test, we are not looking at individual tests here. F test is for the model as a whole. It
may well happen that the we may accept the null of the F test and at the same time accept
one regressor as significant. This is an apparent contradiction between the results of the F
test and individual t tests. This is why theorists ask us to look at F test, as well as,
individual t tests to get a complete picture. The MSR and MSE refer to the numbers from
the above ANOVA Table. Look in column entitled MS. The numbers in MS are simply
ratio of SS and df.

In R the anova command gives the P values, so we need not look up F tables.
In the discussion of the paper I will look for the value of adjusted R square. The larger
the R-square, the better the model. It is never larger than unity.

The discussion should mention F statistic value and its P value.
The discussion should also include individual regression coefficients their t-values, P-
values and 95 percent confidence intervals. If the P-value is smaller than 0.05 the
regression coefficient is said to be statistically significant, that is the particular variable
have a significant effect on the dependent variable. You should discuss this for both

Figure 3 is a plot of regression residuals on vertical axis against X and
Figure 4 is a similar plot of regression residuals against Z.
Where do you get the “residuals” from?. In R if reg1=lm(y~x+z) creates the regression
object, then resid(reg1) will contain the residuals. You can use plot(x,resid(reg1)).
These plots suggest the “lack of fit”. Residuals scattered all over without any particular
pattern is good. If they show a parabolic pattern, it suggests that a nonlinear equation
may have been a better model. If one has time series data and if the adjacent residuals
are close to each other, (autocorrelated) this violates one of the assumptions of the
regression model. Similarly if the residuals systematically fan out they may be
heteroscedastic and fail to satisfy another assumption. Advanced statistical testing for
such properties is beyond the scope of undergraduate students. However, you can do
outlier detection for the residuals. If there are outliers in residuals it means that the
model fits very poorly at those observations.

In R software, anova(reg1) prints the analysis of variance table
Figure 5 is an optional plot of y (predicted value of Y) on the vertical axis against
original Y data itself on horizontal axis. In R the command is yhat=fitted(reg1)
If this plot looks like a 45-degree line (assuming that the scale is comparable along
vertical and horizontal axes) you have a good model.
Conclusion section should have a punch line saying what one learns from the study.

Do not attach entire computer printouts. I need to see the output of summary(reg1)
anova(reg1) etc only. Discuss your output in few words. Shrink the figures on a Xerox
machine if necessary. All figures must have axes labeled. Be sure to use the scatter plot
figures. Do read class notes posted on the web in light of this paper. You will get some
ideas and learn the material better. Illustrative term papers can be followed for guidance.

4. Conclusion and Final Remarks
Final section has this title and discuss your conclusion and your third figure belongs here.
You indicate future work here. For example, if there are outliers you can say that in
future work you will try to get data without outliers or delete the entire observation
containing the outlier and fit the regression over again. Similarly, if the plots suggest that
a non-linear model might have been better, say so here. You also give the punch line
here stating what is interesting about the study and what evidence did you find.

EXAMPLE: The effects of gasoline prices and average insurance rates on the purchasing
of sport automobiles by consumers over a five year period. X1 is going to be the gasoline
price and X2 is going to be the insurance rate in the year respectively with the gasoline
prices. The obvious hypothesis is that when gasoline prices and insurance prices rise the
amount of cars purchased will drop. X1=Gasoline Price, X2=Insurance Rates Y=Amount
of cars Purchased

EXAMPLE: The experience of one of your classmates might help you.
Mr. D wanted to study recessions and wanted to relate recession to unemployment and
interest rates. I told him that recession is either present or absent. This means that it is a
categorical variable (0 or 1) type. Your y variable cannot be a categorical variable, it has
to be a bunch of numbers. I told him that growth rate of GDP can be a proxy real
variable for recession. Growth rate = proportionate change= y
Y= (GDP in quarter # 2 MINUS GDP in previous quarter)/ GDP in previous quarter
X or X1 = interest rate
Z or X2 = unemployment rate
We went to and
We clicked on most frequently asked series.
Annualized growth in GDP is actually already calculated there.
You just need at least 25 data points
The data can be quarters, months, days, whatever.

EXAMPLE: Another student used corruption data link on my web page to pick
Y=GNP per capita
X= adult literacy rate
Z=index of economic freedom
Not all countries have data on these things. She has to select countries that have data on
all three things.

EXAMPLE: from “” website pick 30 ticker symbols for any
companies of interest to you. Let y= earnings per share (EPS), x= market capitalization,
z=wall street beta as a measure of risk of the stock.

To get credit for all your work, attach this to the paper and claim all you have done. It
will be audited by one of you and then I will verify it.
3-part Checklist to be attached at the front of your paper and self-grading
For exact due dates see
PART 1 your name and title of the paper.
1) Is the title sufficiently descriptive? Is it typed centered, bold, properly capitalized, in large font and
along the top line?
0.5 points. I claim full credit for this 
2) Is there your Name, affiliation and E-mail along second and third line?
0.5 points. I claim full credit for this 
3) Is there the (*) Footnote with date, the phrase “partial fulfillment,” etc.
0.5 points. I claim full credit for this 
4) Is the Abstract LESS THAN 125 words and descriptive of the model? (Please do give data sources in the
initial version due couple of weeks before the Easter break but delete it from the final version of the
abstract as attached to the paper due date just before the break, where the data information goes in the data

2.5 points. I claim full credit for this 
5) Are there exactly four numbered sections with bold titles present? How many pages? (Cut a point if
raw data are NOT attached in the program and data appendix).
0.5 points. I claim full credit for this 
Points earned in PART 1 so far are:          Out of 4.5 .

6) Are data sources mentioned? 0.5 points I claim full credit for this 
7) Are descriptive stats including correlation coefficients mentioned?
0.5 points I claim full credit for this 
8) Is outlier detection done for Y, X and Z variables?
1 point I claim full credit for this 
9) Are 4 figures present? Fig 1 has Y vertical &X horizontal (Y vs. X), Fig 2 has Y vs. Z,
Fig 3 has residuals vs X and Fig.4 has residuals vs. Z. Note that if the residuals do not show any particular
pattern, the model as a whole is good.
0.5 points I claim full credit for this 
10) Are the figures labeled correctly? 0.5 points I claim full credit for this 
11) Are the Correct variables represented on the vertical axis ?
0.5 points I claim full credit for this 
12) Are all four figures explicitly discussed in the text? 0.5 points I claim full credit for this 
Points earned in PART 2 so far are:             Out of 4

13) Is regression equation with equation numbers reported? e.g.:
              Y = 0.0123 + 2.3345 X  0.3567 Z + residual (numbers will be different for each)
14) Is it centered? It should be centered.0.5 points I claim full credit for this 
15) Are there two t tests ? Did the student correctly look up critical t values from the t table in the textbook
or from R functions?
16) The NULL and ALTERNATIVE hypotheses should be correctly stated for each test.
1.5 points I claim full credit for this 
17) Is there one F test? Did the student correctly look up critical values for F from the F table in the
textbook? 1 point I claim full credit for this 
18) Are the t tests and F tests discussed somewhere in the text?
0.5 points I claim full credit for this 
19) You should mention possible future work to improve the model in the conclusion section.
Is there a punch line for conclusion? Usually this will comprise the gist of the following 3 things:
The X (spell out the name) variable has (has not) a statistically significant effect on Y (Spell out name).
The Z (spell out the name) variable has (has not) a statistically significant effect on Y (Spell out name).
The overall model is (is not) statistically significant. Note that possible future work is not called the punch
line.     1 point I claim full credit for this 
Total Points earned in PART 3 are:                Out of 4.5
Grand Total of All points claimed are:        Out of 13. Auditor to state his itemized scores in red pen and
sign. Grand total of points given by the auditor are:     Out of 13
If you claim more points than you deserve after I audit your term paper, you will lose at
least twice as many points as the task is worth.
Audited by: write the Name in all caps and signature of the Auditor student. Careful
audit (finding errors or omissions) gets the auditor extra credit points. I believe that
critical reading of other people’s work is an important part of your training. So go
through the checklist carefully.

Shared By: