Regression Analysis by jlranm

VIEWS: 3 PAGES: 25

									Regression Analysis
    Chapter 10
  Regression and Correlation
Techniques that are used to establish whether there is a
mathematical relationship between two or more variables, so that
the behavior of one variable can be used to predict the behavior of
others. Applicable to “Variables” data only.
   • “Regression” provides a functional relationship (Y=f(x))
   between the variables; the function represents the “average”
   relationship.
   • “Correlation” tells us the direction and the strength of the
   relationship.

       analysis starts with a Scatter Plot Y vs X
   TheThe analysis starts with a ScatterPlot ofof Y vs X.
                                                                      2
     Simple Linear Regression
                             What is it?
                             Determines if Y
                             depends on X and
                             provides a math
                             equation for the
y
                             relationship
                             (continuous data)

                             Examples:
             x
                             Process conditions

                             and product properties
    Does Y depend on X?
                             Sales and advertising

                             budget
    Which line is correct?

                                                      3
      Simple Linear Regression

                                                      rise
                                       m = slope =
                                                      run
                  Y

b = Y intercept




                                                         rise
= the Y value
at point that
the line
intersects Y                                run
axis.
                  b

                      0                                           X
          A simple linear relationship can be described mathematically by
                                  Y = mX + b

                                                                            4
   Simple Linear Regression
                                                          (6 - 3)       1
                                       rise    =                    =
                            slope =
                                       run           (10 - 4)           2

                    Y




                5




                                                   rise
                                      run
intercept = 1


                0                                                       X
                    0        5                10

                        Y = 0.5X + 1
Simple regression example
   An agent for a residential real estate
    company in a large city would like to
    predict the monthly rental cost for
    apartments based on the size of the
    apartment as defined by square
    footage. A sample of 25 apartments
    in a particular residential
    neighborhood was selected to gather
    the information

                                         6
                   Size    Rent
                     850     950
                   1450     1600
                   1085     1200
                   1232     1500
                     718     950
                   1485     1700
                   1136     1650



The data on size
                     726     935
                     700     875
                     956    1150



and rent for the
                   1100     1400
                   1285     1650
                   1985     2300



25 apartments
                   1369     1800
                   1175     1400
                   1225     1450



will be analyzed
                   1245     1100
                   1259     1700
                   1150     1200



in EXCEL.
                     896    1150
                   1361     1600
                   1040     1650
                     755    1200
                   1000      800
                   1200     1750
                                   7
       Scatter plot
       2500
       2300
       2100
       1900
       1700
Rent




       1500
       1300
       1100
        900
        700
        500
           500       700      900      1100     1300     1500     1700     1900    2100
                                                Size


       Scatter plot suggests that there is a ‘linear’ relationship between Rent and Size
                                                                                           8
  Interpreting EXCEL output
SUMMARY OUTPUT

        Regression Statistics
Multiple R                 0.85
R Square                   0.72
Adjusted R Square          0.71
Standard Error           194.60
Observations                25

ANOVA
                          df            SS               MS           F      Significance F
Regression                 1        2268776.545      2268776.545 59.91376452 7.51833E-08
Residual                  23        870949.4547       37867.3676
Total                     24          3139726

                     Coefficients   Standard Error      t Stat      P-value     Lower 95%     Upper 95%
Intercept             177.121          161.004          1.100     0.282669853    -155.942      510.184
Size                   1.065            0.138           7.740     7.51833E-08      0.780        1.350


            Regression Equation
            Rent = 177.121+1.065*Size                                                                 9
Interpretation of the
regression coefficient
   What does the coefficient of Size
    mean?

For every additional square feet,
Rent goes up by $1.065




                                        10
Using regression for
prediction
   Predict monthly rent when
    apartment size is 1000 square feet:


Regression Equation:
Rent = 177.121+1.065*Size
Thus, when Size=1000
Rent=177.121+1.065*1000=$1242 (rounded)


                                      11
Using regression for
prediction – Caution!
   Regression equation is valid only over the range
    over which it was estimated!
       We should interpolate

   Do not use the equation in predicting Y when X
    values are not within the range of data used to
    develop the equation.
       Extrapolation can be risky

   Thus, we should not use the equation to predict
    rent for an apartment whose size is 500 square
    feet, since this value is not in the range of size
    values used to create the regression equation.

                                                     12
Why extrapolation is risky
                      Extrapolated relationship




                            True
                        Relationship




                                       In this figure, we fit our
                                       regression model using
       Sample
        Data
                                       sample data – but the linear
 2.5            4.0
                                       relation implicit in our
                                       regression model does not
                                       hold outside our sample! By
                                       extrapolating, we are making
                                       erroneous estimates!

                                                                      13
Correlation (r)
    “Correlation coefficient”, r, is a measure
    of the strength and the direction of the
    relationship between two variables.
    Values of r range from +1 (very strong
    direct relationship), through “0” (no
    relationship), to –1 (very strong inverse
    relationship). It measures the degree of
    scatter of the points around the “Least
    Squares” regression line

                                              14
 Coefficient of correlation
 from EXCEL
SUMMARY OUTPUT

        Regression Statistics
Multiple R                 0.85
R Square                   0.72
Adjusted R Square          0.71
Standard Error           194.60
Observations                25

ANOVA
                          df            SS               MS           F      Significance F
Regression                 1        2268776.545      2268776.545 59.91376452 7.51833E-08
Residual                  23        870949.4547       37867.3676
Total                     24          3139726

                     Coefficients   Standard Error      t Stat      P-value     Lower 95%     Upper 95%
Intercept             177.121          161.004          1.100     0.282669853    -155.942      510.184
Size                   1.065            0.138           7.740     7.51833E-08      0.780        1.350
     The sign of r is the same as that of the coefficient of X (Size) in the regression
     equation (in our case the sign is positive). Also, if you look at the scatter plot,
     you will note that the sign should be positive.


     R=0.85 suggests a fairly ‘strong’ correlation between size and rent.                           15
Coefficient of determination
(r2)
   “Coefficient of Determination”, r-squared,
    (sometimes R- squared), defines the
    amount of the variation in Y that is
    attributable to variation in X




                                            16
  Getting r2 from EXCEL
SUMMARY OUTPUT

        Regression Statistics
Multiple R                 0.85
R Square                   0.72
Adjusted R Square          0.71
Standard Error           194.60
Observations                25

ANOVA
                          df            SS               MS           F      Significance F
Regression                 1        2268776.545      2268776.545 59.91376452 7.51833E-08
Residual                  23        870949.4547       37867.3676
Total                     24          3139726

                     Coefficients   Standard Error      t Stat      P-value     Lower 95%     Upper 95%
Intercept             177.121          161.004          1.100     0.282669853    -155.942      510.184
Size                   1.065            0.138           7.740     7.51833E-08      0.780        1.350


 It is important to remember that r-squared is always positive. It is the square of
 the coefficient of correlation r. In our case, r2=0.72 suggests that 72% of
 variation in Rent is explained by the variation in Size. The higher the value of r2,
 the better is the simple regression model.                                                          17
Standard error (SE)
   Standard error measures the
    variability or scatter of the observed
    values around the regression line.
                    2100
                    1900
                    1700
         Rent ($)




                    1500
                    1300
                    1100
                     900
                     700
                     500
                       500   1000         1500           2000   2500
                                    Size (square feet)

                                                                       18
 Getting the standard error
 (SE) from EXCEL
SUMMARY OUTPUT

        Regression Statistics
Multiple R                 0.85
R Square                   0.72
Adjusted R Square          0.71
Standard Error           194.60
Observations                25

ANOVA
                          df            SS               MS           F      Significance F
Regression                 1        2268776.545      2268776.545 59.91376452 7.51833E-08
Residual                  23        870949.4547       37867.3676
Total                     24          3139726

                     Coefficients   Standard Error      t Stat      P-value     Lower 95%     Upper 95%
Intercept             177.121          161.004          1.100     0.282669853    -155.942      510.184
Size                   1.065            0.138           7.740     7.51833E-08      0.780        1.350




    In our example, the standard error associated with estimating rent is $194.60.                  19
Is the simple regression
model statistically valid?
    It is important to test whether the
     regression model developed from
     sample data is statistically valid.
    For simple regression, we can use
     2 approaches to test whether the
     coefficient of X is equal to zero
    1.   using t-test
    2.   using ANOVA

                                       20
Is the coefficient of X equal
to zero?
   In both cases, the hypothesis we
    test is:

      H 0 : Slope  0
      H1 : Slope  0


What could we say about the linear relationship
between X and Y if the slope were zero?


                                                  21
Using coefficient information
for testing if slope=0
SUMMARY OUTPUT

        Regression Statistics
Multiple R                 0.85
                                                                                  P-value
R Square                   0.72                                                   7.52E-08
Adjusted R Square          0.71
Standard Error           194.60                                                   =7.52*10-8
Observations                25                                                    =0.0000000752
ANOVA
                          df            SS               MS           F      Significance F
Regression                 1        2268776.545      2268776.545 59.91376452 7.51833E-08
Residual                  23        870949.4547       37867.3676
Total                     24          3139726

                     Coefficients   Standard Error      t Stat      P-value     Lower 95%     Upper 95%
Intercept             177.121          161.004          1.100     0.282669853    -155.942      510.184
Size                   1.065            0.138           7.740     7.51833E-08      0.780        1.350


t-stat=7.740 and P-value=7.52E-08. P-value is very small. If it is smaller than
our a level, then, we reject null; not otherwise. If a=0.05, we would reject null
and conclude that slope is not zero. Same result holds at a=0.01 because the P-
value is smaller than 0.01. Thus, at 0.05 (or 0.01) level, we conclude that the
slope is NOT zero implying that our model is statistically valid.                                   22
 Using ANOVA for testing if
 slope=0 in EXCEL
SUMMARY OUTPUT

        Regression Statistics
Multiple R                 0.85
R Square                   0.72
Adjusted R Square          0.71
Standard Error           194.60
Observations                25

ANOVA
                          df            SS               MS           F      Significance F
Regression                 1        2268776.545      2268776.545 59.91376452 7.51833E-08
Residual                  23        870949.4547       37867.3676
Total                     24          3139726

                     Coefficients   Standard Error      t Stat      P-value     Lower 95%     Upper 95%
Intercept             177.121          161.004          1.100     0.282669853    -155.942      510.184
Size                   1.065            0.138           7.740     7.51833E-08      0.780        1.350


  F=59.91376 and P-value=7.51833E-08. P-value is again very small. If it is
  smaller than our a level, then, we reject null; not otherwise. Thus, at 0.05 (or
  0.01) level, slope is NOT zero implying that our model is statistically valid. This
  is the same conclusion we reached using the t-test.                                               23
 Confidence interval for the
 slope of Size
SUMMARY OUTPUT

        Regression Statistics
Multiple R                 0.85
R Square                   0.72
Adjusted R Square          0.71
Standard Error           194.60
Observations                25

ANOVA
                          df            SS               MS           F      Significance F
Regression                 1        2268776.545      2268776.545 59.91376452 7.51833E-08
Residual                  23        870949.4547       37867.3676
Total                     24          3139726

                     Coefficients   Standard Error      t Stat      P-value     Lower 95%     Upper 95%
Intercept             177.121          161.004          1.100     0.282669853    -155.942      510.184
Size                   1.065            0.138           7.740     7.51833E-08      0.780        1.350



 The 95% CI tells us that for every 1 square feet increase
 in apartment Size, Rent will increase by $0.78 to $1.35.
                                                                                                    24
    Summary
   Simple regression is a statistical tool that attempts to fit
    a straight line relationship between X (independent
    variable) and Y (dependent variable)

   The scatter plot gives us a visual clue about the nature of
    the relationship between X and Y

   EXCEL, or other statistical software is used to ‘fit’ the
    model; a good model will be statistically valid, and will
    have a reasonably high R-squared value

   A good model is then used to make predictions; when
    making predictions, be sure to confine them within the
    domain of X’s used to fit the model (i.e. interpolate); we
    should avoid extrapolation

                                                                25

								
To top