# 9

Document Sample

```					d:\docstoc\working\pdf\70d22756-d22f-42f1-b844-a36306e897e5.doc DRAFT Page 1 of 9

9 Simple Linear Regression
9.1 Purposes
Regression is about several statistical operations:
1.      Drawing lines or curves through data to gain understanding
2.      Predicting the value of one variable from another
3.      Accounting for factors in complex situations

9.2 What do you need to do regression?
To make a regression, you need data on two or more variables that are related
through a common variable.

Example 9.1 Life Expectancy and GDP
id   country       Life expectancy index Gross Domestic Product index
2 Norway                            89                           93
3 United States                     86                           95
4 Australia                         99                           90
5 Iceland                           90                           92
6 Sweden                            90                           89
…
172 Burkina Faso                      33                           36
173 Niger                             40                           33
174 Sierra Leone                      22                           25

The two variables are Life Expectancy Index and GDP. The common variable
through which they are related is Country.

9.3 Start simple, with simple linear regression
The basic model for regression is the simple linear regression. If the scatterplot
of the two variables that you believe to be related is more or less like a straight line, then
you can consider fitting a straight line to it. The data of Example 9.1 have a scatterplot
that looks a bit like a straight-line relationship (Figure 9.1).

Created 3/8/01 10:06 AM        DRAFT H & L Spirer              Last save 10/14/2011 8:46:00 PM
d:\docstoc\working\pdf\70d22756-d22f-42f1-b844-a36306e897e5.doc DRAFT Page 2 of 9

Figure 9.1 Scatterplot of the data of Example 9.1
1

Life Expectancy Index
.8

.6

.4

.2

0
0   .2         .4         .6   .8           1
GDP Index

9.4 What is a straight line?
What a question to ask! You certainly know what a straight line looks like:
But to use regression, you have to know how to specify a particular straight line
on a scatterplot. To specify a straight line on a scatterplot, you must state a simple
algebraic formula Y= B*X+A. Y is the variable on the vertical axis (in this case Life
Expectancy Index) and X is variable on the horizontal axis (in this case GDP Index).
B is called the coefficient of X (or of the variable if its name is known). A is
called the intercept. The values of these two coefficients are easily obtained on
computing devices which rapidly perform the computations for you. The computations
can get quite complex and time-consuming if the data set is of a practical size; no one in
their right mind tries to calculate these coefficients manually.
B is also called the slope, because it measures the rate at which Y changes for a
unit change in X. A is called the intercept because it is where the line intercepts the
vertical (Y) axis when X=0.
The variable on the vertical axis is called the dependent variable and the variable
on the horizontal axis is called the independent variable. These slightly misleading
names persist; but many people avoid confusion by talking about the Y variable and the
X variable.
When you have an equation that gives you a straight line that seems to represent
the relationship you see on the scatter plot, it is called a regression line.

Created 3/8/01 10:06 AM                       DRAFT H & L Spirer    Last save 10/14/2011 8:46:00 PM
d:\docstoc\working\pdf\70d22756-d22f-42f1-b844-a36306e897e5.doc DRAFT Page 3 of 9

9.5 What does the regression line look like for Figure 9.1?
Figure 9.2 Scatterplot with regression line
1.00

Life Expy Index   0.80

0.60

0.40

0.20

0.00
0.00   0.10   0.20   0.30   0.40   0.50   0.60     0.70   0.80   0.90   1.00

GDP

Do you feel that this line gives you a good idea of the relationship? Do you feel
that if a particular country made a radical improvement in its GDP that you would be
willing to use this line to predict the new value of Life Expectancy Index?

9.6 What is the algebraic equation for our line?
To obtain the values of A and B, it is necessary to perform a set of computations
on the data, which we do not discuss here (see Appendix F for instructions using
EXCEL).
If you have the computer perform the computations for you, it gives you a value
of X=0.82 and A=0.17. Thus the algebraic equation for this line is:
Y=0.82X+0.17; or in our words rather than symbols,
Life Expectancy Index=(0.82)*GDP+ .17
Now you can answer the two questions in the preceding section:
Do you feel that this line shows the general nature of the relationship between
these two variables? Most people would say “yes,” but you can draw your own
conclusion.
If your answer to the first question is “yes,” then you should be willing to use this
equation to predict a new value of the Life Expectancy Index. For example, if a country’s
GDP were 0.6, then the predicted value of the Life Expectancy Index is:
Predicted value of Index=(0.82)*0.6+.17
=0.66

Created 3/8/01 10:06 AM                          DRAFT H & L Spirer                 Last save 10/14/2011 8:46:00 PM
d:\docstoc\working\pdf\70d22756-d22f-42f1-b844-a36306e897e5.doc DRAFT Page 4 of 9

9.7 Why not some other line?
You are certainly free to draw a line on the scatterplot that you like better. We call
such a line the eyeballed line, because you simply draw what looks to you like a good
line. You could then determine its equation using college algebra methods and use it as
we did this equation.
The advantages of the regression line that we have shown are several:
First, it is computed according to statistical principles that make it “best” in
certain analytical ways. In fact, you will sometimes see simple linear regression referred
to as least-squares regression.
Secondly, everyone who calculates the simple linear regression line will get the
same equation, which is not true for eyeballed lines, for which everyone will have his or
her own line!
Thirdly, you can calculate and automatically plot this line using any of a variety
of statistical programs, business analysis programs, database programs and even pocket
calculators. If you enter the data correctly, you will get the same equation using any of
these tools.

9.8 Interpreting the output of computing devices
If you have a bit of experience, it is an easy matter to enter data and to call up a
program to carry out a linear regression. What is difficult to the beginner is interpreting
the outputs. For a complete regression analysis, a person skilled in the art will use a great
many tools. Most computing programs produce a large number of outputs that are
potentially useful to such people, but may not be of interest to less skilled users.
We will show you typical outputs and how to interpret them to get the limited set
of information relevant to you at this time.

Created 3/8/01 10:06 AM        DRAFT H & L Spirer              Last save 10/14/2011 8:46:00 PM
d:\docstoc\working\pdf\70d22756-d22f-42f1-b844-a36306e897e5.doc DRAFT Page 5 of 9

Figure 9.3 Regression output from EXCEL for Example 9.1

SUMMARY OUTPUT

Regression Statistics
Multiple R                           0.828589987
R Square                             0.686561367
Standard Error                       0.103256531
Observations                                 174

ANOVA
df                  SS           MS          F       Significance F
Regression                                    1    4.016893786 4.016893786 376.7517545    3.41342E-45
Residual                                    172    1.833848743 0.010661911
Total                                       173    5.850742529

Coefficients       Standard Error    t Stat        P-value     Lower 95%    Upper 95% Lower 95.0% Upper 95.0%
Intercept                           0.167483287    0.027568919 6.075076237    7.73731E-09    0.113066276 0.221900299 0.113066276 0.221900299
GDP Index                           0.823672922    0.042435287 19.41009414    3.41342E-45    0.739911877 0.907433967 0.739911877 0.907433967

RESIDUAL OUTPUT

Observation       Predicted Life Expy Index   Residuals
1               0.917025646   -0.017025646
2               0.933499105   -0.043499105
3               0.949972563   -0.089972563
….                     …                         …
172               0.464005539   -0.134005539
173               0.439295352   -0.039295352
174               0.373401518   -0.153401518

Created 3/8/01 10:06 AM           DRAFT H & L Spirer              Last save 10/14/2011 8:46:00 PM
d:\docstoc\working\pdf\70d22756-d22f-42f1-b844-a36306e897e5.doc DRAFT Page 6 of 9

Figure 9.4 Annotated regression output of Figure 9.3 with relevant parts indicated.
SUMMARY OUTPUT

Regression Statistics
Multiple R                            0.828589987   [This   is   the correlation coefficient by another name.]
R Square                              0.686561367   [This   is   the square of the correlation cofficient.]
Adjusted R Square                     0.684739049   [This   is   a modified form of the square of the correlation coefficient.]
Standard Error                        0.103256531   [This   is   a measure of the variability of the Y variable for any given value of X.]
Observations                                  174

ANOVA                    [ANOVA=Analysis of Variance,   a term that you need not remember]
df                       SS              MS           F       Significance F
Regression                                      1       4.016893786 4.016893786 376.7517545        3.41342E-45 [Signficance tells how well the line fits the scatterplot. Lower is better.]
Residual                                     172        1.833848743 0.010661911 [F, above is a measure of how well the line fits the scatterplot. Higher is better.]
Total                                        173        5.850742529

Coefficients            Standard Error      t Stat        P-value         Lower 95%     Upper 95% Lower 95.0% Upper 95.0%
Intercept                                0.167483287         0.027568919 6.075076237 7.73731E-09            0.113066276 0.221900299 0.113066276 0.221900299
GDP Index                                0.823672922         0.042435287 19.41009414 3.41342E-45            0.739911877 0.907433967 0.739911877 0.907433967
[Intercept is the intercept, A. for our purposes, it is just a number added in the equation. In this case you want the number under "Coefficients" to use in your equation.]
[GDP Index is the row for the X variable. In this case you want the number under "Coefficients" to use in your equation.]

RESIDUAL OUTPUT

Observation       Predicted Life Expy Index   Residuals
1               0.917025646   -0.017025646
2               0.933499105   -0.043499105
3               0.949972563   -0.089972563
….                     …                         …
172               0.464005539   -0.134005539
173               0.439295352   -0.039295352
174               0.373401518   -0.153401518

Created 3/8/01 10:06 AM                  DRAFT H & L Spirer                               Last save 10/14/2011 8:46:00 PM
d:\docstoc\working\pdf\70d22756-d22f-42f1-b844-a36306e897e5.doc DRAFT Page 7 of 9

Figure 9.3 and Figure 9.4 show the EXCEL output for Example 9.1. This is much
more information than you need at this time. Figure 9.5 shows the annotated output to
show you what parts of this you can use at this time.

Figure 9.5 gives the relevant items, their value in this case and meaning.
Item                            Value in          Meaning
Figure 9.4
Multiple R                      0.83              This is the correlation coefficient.
R Square                        0.69              The square of the Multiple R. We will
explain its use later.
Adjusted R-Square               0.68              A special version of R-Square, that is
important only for small samples. If it is
more than 10% less than R-Square, use it
in preference.
Observations                    174               The number of cases entering into the
regression.
F                               376.8             This is a number that measures how well
the regression relationship fits the data of
the scatterplot. High values are good.
Usually, a value over 12-15 is a good
sign.
Significance F                  0.00000000        Another measure of the fit of the
regression. Low values (below .05) are
good.
Intercept Coefficient           0.17              The value of A, the intercept.
GDP Index Coefficient           0.82              The value of B, the multiplier of the
value of GDP Index.

9.9 What’s the deal with R-Square?
As you saw in Chapter 7, the correlation coefficient is a number that give you a
rough idea of how “tight” is a correlation, with values closer to +1 or –1 indicating a
tighter distribution. You also saw that the correlation coefficient does not have a strict
one-to-one relationship with patterns of correlation. The same correlation coefficient can
correspond to many patterns of correlation when viewed as a scatterplot. Also, there is no
physical interpretation that you can make for this number. It is a number calculated by
formulas that is helpful but impossible to directly relate to the patterns.
If you have a general understanding of regression analysis, then R-Square, the
square of the correlation coefficient can be given a physical interpretation that will enable
you to make a general judgment about the relevance and utility of a particular line in
explaining a relationship.
Created 3/8/01 10:06 AM        DRAFT H & L Spirer              Last save 10/14/2011 8:46:00 PM
d:\docstoc\working\pdf\70d22756-d22f-42f1-b844-a36306e897e5.doc DRAFT Page 8 of 9

All countries do not have the same value of Life Expectancy Index. You can
readily see that from the summary statistics (Figure 9.6).

Figure 9.6 Summary statistics for Life Expectancy Index.
Maximum                                           0.99
Third Quartile                                    0.81
Median                                            0.74
First Quartile                                    0.55
Minimum                                           0.22
Mean                                              0.68
Standard Deviation                                0.18
n                                                 172

The range of values is almost the whole of the possible range, 0.22 to 0.99. This is
a great variation. One way to look at the scatterplot is that we are trying to see if we can
explain some of this variability by the Index’s relationship with the GDP. Instead of
dealing with an index that ranges from 0.22 to 0.99, if we know the value of GDP, then
we can predict the value of the Life Expectancy Index.
If you look at the scatterplot, you can see that there is still variability around the
predictions that you make using the regression line. However, this variability is much
less. For example, if the GDP is 0.80, the predicted value is about 0.80 and the points
around it range from 0.77 to 0.85. That is a lot less variation!
We use R-Square to judge regressions because it is the proportion of
variability that is explained by the regression relationship.
In this case, the correlation coefficient is 0.83, which sounds like an excellent
correlation, but the fact is, this line explains only 69% of the variability in the Life
Expectancy Index. That is actually quite good for socioeconomic data, but it is not
excellent.
Use this rule. Whenever anyone tells you that they have found a correlation of
such-and-so, square it and then decide what proportion of the total variability in the
dependent variable is explained by the independent variable.
You are now ready to make and evaluate elementary simple linear regressions.
In Appendix F, we show two approaches to obtaining regression outputs using
EXCEL:
1.                Quick and easy method. If you have a scatterplot, then you can
quickly plot the regression line on it and obtain the coefficients A
and B and R-Squared.

Created 3/8/01 10:06 AM          DRAFT H & L Spirer              Last save 10/14/2011 8:46:00 PM
d:\docstoc\working\pdf\70d22756-d22f-42f1-b844-a36306e897e5.doc DRAFT Page 9 of 9

2.                The full Monty. Used to obtain the extensive outputs shown in
Figure 9.3 and Figure 9.4.

Created 3/8/01 10:06 AM          DRAFT H & L Spirer            Last save 10/14/2011 8:46:00 PM

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 1 posted: 10/15/2011 language: English pages: 9
How are you planning on using Docstoc?